UNIVERSIDAD NACIONAL AUTÓNOMA DE MÉXICO PROGRAMA DE DOCTORADO EN CIENCIAS BIOMÉDICAS INSTITUTO DE ECOLOGÍA Patrones macro y microevolutivos en calabazas (Cucurbita): ritmos de evolución genómica y patrones de genómica poblacional durante la domesticación TESIS QUE PARA OPTAR POR EL GRADO DE: DOCTOR EN CIENCIAS PRESENTA: JOSUÉ BARRERA REDONDO DIRECTOR DE TESIS DR. LUIS ENRIQUE EGUIARTE FRUNS INSTITUTO DE ECOLOGÍA, UNAM COMITÉ TUTOR DR. LUIS DAVID ALCARAZ PERAZA FACULTAD DE CIENCIAS, UNAM DR. ENRIQUE IBARRA LACLETTE INSTITUTO DE ECOLOGÍA A.C. (INECOL) CIUDAD UNIVERSITARIA, CDMX. AGOSTO 2020 UNAM – Dirección General de Bibliotecas Tesis Digitales Restricciones de uso DERECHOS RESERVADOS © PROHIBIDA SU REPRODUCCIÓN TOTAL O PARCIAL Todo el material contenido en esta tesis esta protegido por la Ley Federal del Derecho de Autor (LFDA) de los Estados Unidos Mexicanos (México). El uso de imágenes, fragmentos de videos, y demás material que sea objeto de protección de los derechos de autor, será exclusivamente para fines educativos e informativos y deberá citar la fuente donde la obtuvo mencionando el autor o autores. Cualquier uso distinto como el lucro, reproducción, edición o modificación, será perseguido y sancionado por el respectivo titular de los Derechos de Autor. El jurado de examen doctoral estuvo constituido por: Dr. Diego Cortez Quezada Presidente Dr. Luis Enrique Eguiarte Fruns Secretario Dra. Eria Rebollar Caudillo Vocal Dra. Alicia Mastretta Yanes Vocal Dra. Araxi Urrutia Odabachian Vocal AGRADECIMIENTOS Agradezco a la Universidad Nacional Autónoma de México (UNAM) por permitirme a mí y a miles de mexicanos realizar nuestros estudios superiores con la más alta calidad, de manera pública y gratuita. Gracias al Doctorado en Ciencias Biomédicas por permitirme realizar mis estudios de doctorado en el Instituto de Ecología de la UNAM y al Consejo Nacional de Ciencia y Tecnología (CONACyT) por brindarme los recursos económicos necesarios para mi manutención durante la realización de esta tesis (beca número 583146). Agradezco al Programa de Apoyo a los Estudios de Posgrado (PAEP) y al proyecto Ecos Nord M12-A03 “Genómica de poblaciones: estudios en el maíz silvestre, el teosinte (Zea mays ssp. parviglumis y Zea mays ssp. mexicana)” por el financiamiento otorgado para realizar estancias de investigación en México y en Francia, respectivamente. La investigación realizada en esta tesis fue financiada por la Comisión Nacional para el Conocimiento y Uso de la Biodiversidad (CONABIO) KE004 ‘‘Diversidad genética de las especies de Cucurbita en México e hibridación entre plantas genéticamente modificadas y especies silvestres de Cucurbita”, CONABIO PE001 ‘‘Diversidad genética de las especies de Cucurbita en México. Fase II. Genómica evolutiva y de poblaciones, recursos genéticos y domesticación’’, CONACyT Investigación Científica Básica 2011.167826 ‘‘Genómica de poblaciones: estudios en el maíz silvestre, el teosinte (Zea mays ssp. parviglumis y Zea mays ssp. mexicana)’’ y CONACyT Problemas Nacionales 247730. Durante el trascurso de esta tesis se realizaron análisis en clúster de cómputo de la CONABIO, el cual fue parcialmente financiado por la Secretaría de Medio Ambiente y Recursos Naturales (SEMARNAT) por medio del fondo ‘‘Contribución de la Biodiversidad para el Cambio Climático’’ otorgado a la CONABIO. Agradezco a Rodrigo García Herrera, jefe del Departamento de Cómputo Científico del Laboratorio Nacional de Ciencias de la Sostenibilidad (LANCIS), Instituto de Ecología, UNAM, por correr la infraestructura de computación de alto rendimiento del LANCIS con la que se realizaron parte de los análisis de esta tesis. Se reconoce el soporte técnico del MC. Emanuel Villafán y a los recursos de cómputo de alto rendimiento que el Instituto de Ecología A.C. (INECOL) hizo disponibles para realizar parte de los resultados descritos en esta tesis. Gracias a la Red de Estudios Moleculares Avanzados del INECOL por permitirme realizar diversas estancias en sus instalaciones y permitirme formar parte de su vida académica. Agradezco al Dr. Diego Cortez Quezada, la Dra. Eria Rebollar Caudillo, la Dra. Alicia Mastretta Yanes y la Dra. Araxi Urrutia Odabachian por sus valiosas observaciones y comentarios, los cuales mejoraron sustancialmente la calidad de esta tesis. Agradezco al Dr. Luis E. Eguiarte y a la Dra. Valeria Souza por permitirme realizar mis estudios de posgrado en su laboratorio, además de brindarme su apoyo académico y personal durante estos años. Agradezco al Dr. Rafael Lira por todas sus contribuciones y apoyo durante estos 5 años. Agradezco también al Dr. Daniel Piñero por todo el apoyo y la colaboración que hemos realizado estos años. Agradezco al Dr. Santiago Ramírez Barahona por haber sido mi tutor en la licenciatura y permitirme desarrollar mi potencial como investigador. Agradezco mucho a la Dra. Alejandra Vázquez Lobo por toda la ayuda y el apoyo que me brindó durante el inicio de mi doctorado, y por convencerme de realizar este proyecto. Agradezco a la Dra. Erika Aguirre Planter y a la Dra. Laura Espinosa Asuar por su ayuda y apoyo técnico. Agradezco a la Sra. Silvia Barrientos por su ayuda y por su amistad. Agradezco a los miembros del Laboratorio de Evolución Molecular y Experimental por formar parte de mi vida académica y personal durante estos 5 años. Le agradezco al Dr. Enrique Ibarra Laclette su apoyo y su amistad a lo largo de estos años, la cual quiero que dure toda la vida. Agradezco mucho a mi familia por brindarme su cariño y apoyo durante toda la vida, sé que cuento con ustedes y quiero que sepan que ustedes cuentan conmigo. Le agradezco a Jazmín muchas cosas: te agradezco por formar parte de mi vida, por soportarme y ayudarme en los momentos difíciles, por mejorar los momentos felices estando juntos, por las pláticas triviales y profundas, por darme tu cariño, por permitirme darte el mío y por hacerme crecer como persona. 1 INDICE RESUMEN 2 ABSTRACT 3 INTRODUCCIÓN 4 Evolución molecular y genómica 4 Patrones micro y macroevolutivos de evolución molecular 5 Los procesos de domesticación en plantas 6 El Género Cucurbita 7 Duplicación completa del genoma en Cucurbita 11 Domesticación en el género Cucurbita 13 OBJETIVOS 16 CAPÍTULO 1: ENFOQUES MODERNOS AL ANÁLISIS DE LA DOMESTICACIÓN 17 Artículo de revisión: Genomic, transcriptomic and epigenomic tools to study the domestication of plants and animals: a field guide for beginners 18 CAPÍTULO 2: ENSAMBLE DEL GENOMA DE REFERENCIA DE Cucurbita argyrosperma Y DINÁMICA EVOLUTIVA DE LAS FAMILIAS GÉNICAS CODIFICANTES Y NO CODIFICANTES EN EL GÉNERO Cucurbita 42 Artículo de investigación: The genome of Cucurbita argyrosperma (silver-seed gourd) reveals faster rates of protein- coding gene and long noncoding RNA turnover and neofunctionalization within Cucurbita 43 CAPÍTULO 3: ANÁLISIS GENÓMICO DE LA DOMESTICACIÓN DE Cucurbita argyrosperma 58 Artículo de investigación: The domestication patterns of Cucurbita argyrosperma are consistent with early migration events in Mesoamerica and general domestication syndromes in squashes 59 DISCUSIÓN 82 La duplicación completa del genoma como evento macroevolutivo: ¿Una cuestión de escala de tiempo o de propiedades emergentes? 83 Innovación evolutiva a escala molecular 89 El estudio de la domesticación a la luz de la genética de poblaciones 94 Los estudios de domesticación como herramienta biotecnológica 98 CONCLUSIONES 101 REFERENCIAS 102 ANEXO 1: MATERIAL SUPLEMENTARIO DEL CAPÍTULO 1 113 ANEXO 2: MATERIAL SUPLEMENTARIO DEL CAPÍTULO 2 116 ANEXO 3: MATERIAL SUPLEMENTARIO DEL CAPÍTULO 3 141 2 RESUMEN El género Cucurbita (las calabazas) es un grupo de 13 especies de plantas que son nativas del continente americano. Son plantas herbáceas con una historia evolutiva que incluye una duplicación completa del genoma y al menos seis eventos independientes de domesticación. Estudiar los patrones de evolución molecular derivados de ambos procesos es útil para entender los procesos evolutivos a distintas escalas. Los resultados que se presentan en esta tesis se enfocan en la especie Cucurbita argyrosperma (la calabaza pipiana), un cultivo tradicional mexicano del cual se consume principalmente su semilla. La tesis presenta una revisión de los métodos modernos para estudiar el proceso de domesticación en plantas y animales. Se secuenció, ensambló y anotó el genoma de C. argyrosperma para compararlo con los genomas disponibles de otras especies del género Cucurbita y de la familia Cucurbitaceae. Esto con el objeto de comprender los cambios en el ritmo y modo de evolución molecular en las calabazas después del evento de duplicación completa del genoma. Usando datos de representación reducida del genoma de múltiples poblaciones, se identificaron patrones demográficos y los barridos selectivos que moldearon el genoma de esta especie durante su domesticación. Finalmente, se discute la importancia de los resultados obtenidos para entender la dinámica evolutiva de los genomas en las calabazas, sus aportaciones al entendimiento de la evolución molecular y la importancia teórica y práctica del estudio de la domesticación. 3 ABSTRACT The Cucurbita genus (Pumpkins, squashes and gourds) is a group of 13 plant species that are native to the American continent. They are herbaceous plants with an evolutionary history that includes a whole-genome duplication and at least six independent domestication events. Studying the patterns of molecular evolution derived from both processes is useful to understand the evolutionary processes at different timescales. The results presented in this thesis focus on the species Cucurbita argyrosperma (the silver-seed gourd), a Mexican crop that is cultivated mainly for seed consumption. The thesis first summarizes the modern methods used to study the domestication process of plants and animals. We sequenced, assembled and annotated the genome of C. argyrosperma and compared it with the available genomes of other Cucurbita and Cucurbitaceae species to understand the changes in the tempo and mode of molecular evolution of the Cucurbita genus after its whole-genome duplication. Additional reduced-representation sequencing data from multiple populations allowed us to reconstruct the demographic patterns and identify selective sweeps that shaped the genome of this species throughout its domestication process. Finally, we discuss the importance of our results to understand the evolutionary dynamics of Cucurbita genomes, their contribution to our understanding of molecular evolution and both the theoretical and practical importance of domestication studies. 4 INTRODUCCIÓN Evolución molecular y genómica Desde la publicación de “El origen de las especies” en 1859 hasta mediados del siglo pasado, la evolución había sido estudiada con base en caracteres morfológicos (Hedrick, 2011). Con los avances teóricos y tecnológicos realizados en biología molecular desde la década de los 60’s hasta el día de hoy, se han utilizado distintos marcadores moleculares para comprender los procesos evolutivos que moldean a los organismos a lo largo del tiempo (Schlötterer, 2004). Emile Zuckerkandl y Linus Pauling argumentaron en 1965 que las proteínas, el ADN y el ARN pueden ser vistos como documentos de la historia evolutiva de las especies (Zuckerkandl y Pauling, 1965). Posteriormente Richard Lewontin y John Hubby realizaron en 1966 los primeros estudios con variantes alélicas de enzimas (aloenzimas) en poblaciones naturales de Drosophila pseudoobscura, abriendo las puertas a una nueva forma de estudiar la genética de poblaciones (Lewontin y Hubby, 1966). Más adelante, Motoo Kimura desarrolla en 1968 la teoría neutral de la evolución, argumentando que el costo de selección para mantener la cantidad observada de variación era demasiado alto para que cualquier población mamífera pudiera soportarla, por lo que la mayor parte de las diferencias genéticas entre organismos debía ser moldeada principalmente por la mutación y la deriva génica (Kimura, 1968). Todos estos eventos cimentaron el desarrollo de una nueva disciplina: la evolución molecular (Schlötterer, 2004). La evolución molecular estudia los procesos y eventos históricos responsables tanto del origen de la variación genética como de los cambios en el material genético a lo largo del tiempo (Futuyma, 2005). Pronto aparecieron métodos que permitían analizar directamente la variación en el ADN, como los polimorfismos de longitud en fragmentos de restricción (RFLPs por sus siglas en inglés), los microsatélites o la secuenciación de ADN por el método de Sanger (Schlötterer, 2004). Recientemente se han desarrollado tecnologías de secuenciación masiva, las cuales permiten conocer y analizar los genomas completos de los organismos (Eguiarte et al., 2013). Dado que los costos de secuenciación son cada vez menores, los estudios genómicos ya no están limitados a organismos modelo, como Arabidopsis thaliana o Zea mays, sino que pueden 5 aplicarse a cualquier organismo de interés. El uso de estas nuevas tecnologías nos permite comprender el comportamiento en la totalidad del genoma de las poblaciones y de las especies, permitiéndonos estudiar procesos microevolutivos mediante la genómica de poblaciones, hasta procesos macroevolutivos mediante genómica comparada (Eguiarte et al., 2013). Patrones micro y macroevolutivos de evolución molecular La microevolución se suele definir como los patrones y cambios en la variación biológica a nivel poblacional, mientras que la macroevolución se suele atribuir a procesos observados a jerarquías taxonómicas superiores al de especie (Reznick y Ricklefs, 2009). El término macroevolución fue acuñado hace casi un siglo al pensarse que los cambios en los planes corporales no podían ser explicados por los cambios a escala poblacional propuestos originalmente por Charles Darwin (Philiptschenko, 1927). Hoy en día, la distinción entre la macroevolución y la microevolución suele ser ambigua e incluso controversial (Hautmann, 2020), ya que su distinción comúnmente se atribuye a la escala de tiempo en la que se observan los procesos microevolutivos que moldean a los organismos, mientras algunos autores sugieren que los patrones macroevolutivos requieren de procesos adicionales a los descritos por los estudios microevolutivos para poder ser explicados (Erwin, 2001), tales como los modelos de autoorganización (Kauffman, 1993) o las restricciones estructurales en los organismos (Webster y Goodwin, 1984). Independientemente de estas controversias, los procesos microevolutivos y macroevolutivos requieren de diferentes aproximaciones para poder ser analizadas a nivel molecular. Los procesos microevolutivos pueden ser analizados bajo el marco de la genética de poblaciones, una herramienta útil que nos permite comprender cómo las distintas fuerzas evolutivas han moldeado la variación genética observada en las poblaciones a una escala ecológica. La genómica de poblaciones consiste en la aplicación de la teoría de genética poblacional sobre datos de genomas completos, lo cual nos permite discernir la demografía histórica de las poblaciones, identificar 6 aquellas regiones que han sido sometidas a presiones de selección, conocer los tiempos de divergencia entre los organismos, detectar los patrones de introgresión y flujo genético entre poblaciones y encontrar asociaciones entre genotipos y fenotipos (Eguiarte et al., 2013). Esto resulta particularmente útil para entender los procesos de adaptación local, o en el caso de las plantas cultivadas, los procesos de domesticación. Por otro lado, los procesos macroevolutivos suelen analizarse bajo el marco de la biología comparada con un enfoque filogenético, ya sea mediante el estudio de fósiles, filogenias, el desarrollo embrionario o genomas completos (Reznick y Ricklefs, 2009). En este sentido, la genómica comparada, o sea el análisis comparativo de genomas completos, nos permite analizar procesos macroevolutivos de evolución molecular como el origen, el cambio, la ganancia y la pérdida de los genes y otros elementos funcionales en los genomas, los eventos de rearreglos cromosómicos entre distintos taxa, la dinámica evolutiva de los transposones en los genomas eucariontes, entender los cambios en el orden relativo de los genes en los cromosomas (i.e., la sintenia), entender las convergencias evolutivas a nivel molecular, o la relación entre la aparición de ciertos genes e innovaciones morfológicas y fisiológicas en los organismos (Xia, 2013). Los procesos de domesticación en plantas La domesticación puede ser definida como un proceso de evolución mutualista en la cual un organismo asume control sobre la reproducción y cuidado de otro organismo, de manera que pueda aprovechar algún recurso de interés (Zeder 2015). Las presiones selectivas ejercidas sobre los taxa domesticados suele favorecer la fijación de caracteres que son de interés para el domesticador, lo cual conlleva a la divergencia morfológica, fisiológica y molecular entre el taxón domesticado y su pariente silvestre (Purugganan y Fuller 2009; Zizumbo-Villarreal y Colunga-GarcíaMarín 2010). 7 La agricultura surgió a consecuencia de la domesticación de las plantas (Diamond, 2002). La agricultura es considerada una de las innovaciones tecnológicas más importantes de la humanidad, con la cual los grupos nómadas pasaron a formar grupos sedentarios hace 10,500 años, gracias a la producción confiable y regular de alimento (Diamond, 2002). En este sentido, la domesticación podría considerarse una relación mutualista altamente exitosa, dado que en la actualidad las plantas cultivadas se han propagado a lo largo del planeta, ocupando grandes extensiones de tierra, mientras que las civilizaciones humanas han logrado prosperar y expandirse (Purugganan y Fuller, 2009). Los organismos domesticados son sistemas útiles para comprender procesos evolutivos como la adaptación, la especiación, la coevolución y la evolución de los genomas (Hancock, 2005; Gepts, 2014); además de revelarnos información importante sobre caracteres de importancia agronómica en los taxa estudiados (Gustafson et al., 2008; Hufford et al., 2012). Estudios enfocados al análisis de la variación genética a nivel genómico han revelado loci candidatos a haber cambiado debido a las fuerzas de selección artificial durante la domesticación (Meyer y Purugganan, 2013; Gepts, 2014). Estos cambios genéticos explican las diferencias morfológicas y bioquímicas entre taxa domesticados y sus parientes silvestres (Qi et al., 2013; Shang et al., 2014). El grupo de interés en esta tesis son las calabazas, las cuales han pasado por múltiples eventos de domesticación (Castellanos-Morales et al., 2018). El género Cucurbita El género Cucurbita, comúnmente conocidos como calabazas en México o zapallos en Sudamérica, son un grupo de plantas herbáceas nativas del continente americano que presentan frutos pepónides (Lira et al., 2009) y que consta de 13 especies (Paris, 2016). Cucurbita es un sistema de estudio interesante, dado que presenta seis eventos independientes de domesticación, que en su mayoría se propone que sucedieron en México (Nee, 1990; Zheng et al., 2013; Castellanos- Morales et al., 2018). Además, Cucurbita es uno de los géneros con mayor variedad 8 morfológica en las angiospermas (Bisognin, 2002). La mayor parte de los taxa se distribuyen en México, con un origen relativamente reciente (16±7 Ma) desde su divergencia con su género hermano, Peponopsis (Schaefer et al., 2009). Estas plantas son de importancia alimenticia a escala nacional e internacional, dado que se consumen los frutos, las semillas, las flores y las partes “tiernas” de los tallos (Lira et al., 2009). Además, poseen compuestos químicos de importancia económica, como las saponinas, que se usan como detergente y “jabón de calabaza” (Nee, 1990) y la cucurbitacina, un compuesto secundario amargo característico de la familia Cucurbitaceae (Bisognin, 2002) que se concentra en los frutos de las calabazas y que posee propiedades antitumorales y antifúngicas (Tallamy et al., 2005; Lee et al., 2010). Uno de los clados del género Cucurbita [grupo Argyrosperma sensu Lira et al. (2009)] contiene a los taxa domesticados C. argyrosperma subsp. argyrosperma (la calabaza pipiana) y C. moschata (la calabaza de castilla), y al taxón silvestre C. argyrosperma subsp. sororia (Nee, 1990; Zheng et al., 2013), cuyas poblaciones se distribuyen por toda la costa del Pacífico en México y Centroamérica, desde Sonora hasta llegar a Nicaragua, con unas poblaciones aisladas en Veracruz (Lira et al., 2009). En particular, el taxón domesticado C. argyrosperma subsp. argyrosperma es interesante para el estudio de la domesticación, ya que se conoce su descendencia a partir de una población ancestral del taxón silvestre C. argyrosperma subsp. sororia (Sanjur et al., 2002; Sánchez-de la Vega et al., 2018). Adicionalmente, ambos taxa están cercanamente emparentados con el taxón domesticado C. moschata (Sanjur et al., 2002; Castellanos-Morales et al., 2018), del cual se desconocen los ancestros silvestres de los cuales pudo originarse (Nee, 1990). Los tres taxa son principalmente polinizados por abejas de los géneros Peponapis y Xenoglossa, siendo frecuente el entrecruzamiento ente el taxón silvestre y los dos taxa domesticados (Lira et al., 2009). Los registros arqueológicos más antiguos de C. argyrosperma subsp. argyrosperma datan de hace 8,700 años y fueron encontrados en el Valle Central del río Balsas en Guerrero (Piperno et al., 2009), mientras que los registros más antiguos de C. moschata se han encontrado 9 en el Valle de Tehuacán (Puebla) con 6900-5500 años de antigüedad (Whitaker, 1981). Sin embargo, el registro inequívoco más antiguo de calabazas cultivadas pertenece a Cucurbita pepo en la cueva de Guilá Naquitz (Oaxaca), con ~10,000 años de antigüedad (Smith, 1997). El cultivo de C. argyrosperma se ha enfocado al uso de sus semillas, con menor énfasis en el uso de sus flores, tallos y frutos (Paris, 2016), desencadenando múltiples diferencias morfológicas que la distinguen de su pariente silvestre (Fig. 1). Esto lo vuelve nuevamente un buen modelo de estudio para comprender los procesos de domesticación en calabazas, ya que se piensa que los primeros caracteres en ser seleccionados durante la domesticación de estas especies fueron las semillas (Lira et al., 2016). Por otro lado, C. moschata es aprovechada de manera más amplia, consumiéndose sus frutos maduros e inmaduros, semillas, tallos y flores como fuente de alimento, e incluso sus saponinas para la producción de jabón de calabaza (Paris, 2016). Esto sugiere que ambos taxa han pasado por presiones de selección similares en algunos rubros (e.g., tamaño de la semilla) y presiones distintas en otros (e.g., suculencia y sabor de la pulpa en el fruto). Cucurbita argyrosperma y C. moschata presentan, como casi todas las especies del género, 20 pares de cromosomas (Whitaker, 1933) y el tamaño estimado de sus genomas por citometría de flujo son de ~366 Mbp y ~416 Mbp, respectivamente (Bennett y Smith, 1976; Šiško et al., 2003). Se han realizado estudios de mapeo cromosómico en C. pepo (Zraidi et al., 2007; Esteras et al., 2012), C. moschata (Gustafson et al., 2008) y C. maxima (Zhang et al., 2015), los cuales han revelado un alto grado de sintenia entre especies, sugiriendo que el orden relativo de los genes en los cromosomas esta conservado dentro del género Cucurbita (Gong et al., 2008; Gustafson et al., 2008). Recientemente han avanzado los estudios genómicos enfocados en el género Cucurbita, y hasta el momento se han publicado genomas de referencia de alta calidad para las especies C. pepo (Montero-Pau et al., 2018), C. moschata (Sun et al., 2017) y C. maxima (Sun et al., 2017). También se han secuenciado los genomas de otras especies de la familia Cucurbitaceae con distinto número cromo- 10 Figura 1. Diferencias morfológicas entre Cucurbita argyrosperma subsp. argyrosperma (domesticada; derecha) y Cucurbita argyrosperma subsp. sororia (silvestre; izquierda). Se observan diferencias en (A) el tamaño y forma del fruto, (B) el tamaño y forma de la semilla, (C) y la presencia de tricomas urticantes. (D) Escenario de domesticación propuesto para C. argyrosperma. 11 -sómico cercanas a Cucurbita como el pepino (Cucumis sativus; 2n=14) (Huang et al., 2009), el melón (Cucumis melo; 2n=24) (Garcia-Mas et al., 2012), la sandía (Citrullus lanatus; 2n=22) (Guo et al., 2012), el guaje (Lagenaria siceraria; 2n=24) (Wu et al., 2017) y el melón amargo (Momordica charantia; 2n=22) (Urasaki et al., 2017). Duplicación completa del genoma en Cucurbita Desde hace décadas se ha sospechado de un origen alotetraploide (i,e., un organismo que adquiere poliploidía a partir de un evento de hibridización) del género Cucurbita (Sun et al., 2017). La sospecha inicial surgió a raíz de un estudio citológico entre especies del mismo género (Weiling, 1959) y por el hecho de que las especies del género Cucurbita presentan un número cromosómico mayor al de otros miembros de la familia Cucurbitaceae (Xie et al., 2019). Estudios citogenéticos y análisis de expresión genética con isoenzimas parecían apoyar la hipótesis de que el ancestro común del género Cucurbita había pasado por un evento de duplicación completa del genoma (Weeden, 1984; Singh, 1990). Un estudio más reciente utilizando un mapa genético de C. pepo basado en SNPs (i.e., polimorfismos de un solo nucleótido) encontró bloques sinténicos que se sobrelapaban entre distintos cromosomas de esta especie, sugiriendo nuevamente un grado de duplicación dentro de los genomas de Cucurbita (Esteras et al., 2012). No fue hasta que se generaron los primeros genomas de referencia para este género que se confirmó un evento de duplicación completa del genoma en Cucurbita, el cual se piensa que ocurrió hace 30± 4 millones de años (Sun et al., 2017; Montero-Pau et al., 2018). La confirmación de esta duplicación genómica se basa en tres resultados clave: A) la observación de macrosintenia entre cromosomas de Cucurbita y otras cucurbitáceas, observando relaciones de sintenia 2:1; B) la observación de un exceso de genes parálogos por familia génica en Cucurbita a comparación de otras cucurbitáceas C) y el hallazgo de una divergencia sincrónica entre dichos genes parálogos dentro de los genomas de calabazas 12 mediante un análisis de sitios sinónimos degenerados (Sun et al., 2017; Montero- Pau et al., 2018). Después de un evento de duplicación, los genomas suelen pasar por un proceso denominado fraccionamiento, el cual consiste en la perdida diferencial de genes y otros elementos no codificantes por parte de ambas copias del genoma, lo cual lleva a una diferenciación entre las dos copias y una diploidización (i.e., pasar de un estado poliploide a un estado diploide) del genoma (Fig. 2; Cheng et al., 2018). Durante la diploidización del genoma duplicado, el contenido de genes dentro del genoma resultante suele ser similar al de sus parientes no duplicados, aunque la arquitectura de dicho genoma cambie radicalmente (Cheng et al., 2018). A diferencia de otros eventos de duplicación derivados de una alotetraploidización, los genomas del género Cucurbita pasaron por un proceso de fraccionamiento equilibrado, es decir, que ambas copias del genoma perdieron genes de manera equitativa durante el proceso de fraccionamiento y diploidización del genoma (Sun et al., 2017), a diferencia de otros alotetraploides donde un genoma suele dominar sobre el otro (Cheng et al., 2018). A pesar de haber pasado por un fraccionamiento equilibrado, no se había estudiado el efecto que pudiera tener esta duplicación completa del genoma sobre la tasa de evolución de los genomas de calabazas, particularmente en familias de genes codificantes y no codificantes (Ponting et al., 2009; Magadum et al., 2013; Nelson y Shippen, 2015). Se ha especulado mucho sobre las consecuencias macroevolutivas que han tenido las duplicaciones completas del genoma sobre la historia evolutiva de las plantas, pero se han realizado pocos estudios empíricos que pongan a prueba dichas hipótesis (Clark y Donoghue, 2018). Dadas las complicaciones en el estudio evolutivo de elementos no codificantes en los genomas, la mayor parte de los estudios enfocados en duplicaciones completas del genoma se enfocan en los genes codificantes (Ulitsky, 2016; Nelson et al., 2017). No obstante, se ha observado que los elementos no codificantes, tales como los ARNs largos intergénicos no codificantes (lincRNAs) aceleran su tasa de recambio después de un evento de “disturbio genómico”, tales como rearreglos cromosómicos o duplicaciones genómicas (Nelson y Shippen, 2015). 13 Por lo tanto, el estudio del género Cucurbita bajo un enfoque de genómica comparada nos permite poner a prueba los posibles efectos que puedan tener una duplicación completa del genoma sobre la dinámica y origen de familias de genes codificantes y no codificantes. Figura 2. Esquema de los procesos de fraccionamiento y diploidización posteriores a una duplicación completa del genoma. Domesticación en el género Cucurbita La disponibilidad de recursos genómicos en calabazas nos permite investigar los procesos microevolutivos que han moldeado los genomas de estas plantas durante 14 su domesticación. Se han realizado estudios genómicos enfocados a la domesticación en cultivos americanos como el maíz (Hufford et al., 2012; Jiao et al., 2012; Romay et al., 2013), el frijol (Schmutz et al., 2014), el jitomate (Koenig et al., 2013; Lin et al., 2014) y el chile (Qin et al., 2014). También se han estudiado genómicamente aspectos de la domesticación en otros cultivos de importancia mundial como la soya (Lam et al., 2010; Li et al., 2013), el arroz (Xu et al., 2011) y el pepino (Qi et al., 2013; Shang et al., 2014). Las calabazas, a pesar de ser cultivos de alta importancia a nivel mundial (Lira et al., 2009), carecen hasta el momento de estudios a nivel genómico donde se analice su domesticación. Sin embargo, los estudios genómicos y fisiológicos realizados en otras cucurbitáceas sugieren que se han seleccionado caracteres similares en las cucurbitáceas domesticadas, como el aumento en el tamaño del fruto y una pérdida sistemática de la amargura en los frutos (Chomicki et al., 2019). La pérdida de la amargura en otras cucurbitáceas está ligada a presiones selectivas convergentes sobre el factor transcripcional Bt, el cual regula la actividad transcripcional de la cucurbitadienol sintasa, la primera enzima en la vía de biosíntesis de la cucurbitacina (Shang et al., 2014; Zhou et al., 2016). Sin embargo, existen pocos casos donde se hayan seleccionado caracteres asociados a la semilla en cucurbitáceas (Chomicki et al., 2019), volviendo a las calabazas, y particularmente a C. argyrosperma, un buen modelo para entender la domesticación asociada al consumo de semilla. Se han realizado estudios previos que elucidan la historia demográfica de las calabazas durante su domesticación, basándose en el uso de microsatélites y secuencias de cloroplasto (Sánchez-de la Vega et al., 2018; Castellanos-Morales et al., 2019). En ambos estudios, se encontró una diferenciación genética entre las subespecies domesticadas y sus parientes silvestres (Sánchez-de la Vega et al., 2018; Castellanos-Morales et al., 2019). En el caso de C. argyrosperma, se ha observado una variación genética similar en las poblaciones domesticadas y silvestres, además de evidencia de flujo genético entre ambos taxa (Sánchez-de la Vega et al., 2018). Adicionalmente, se ha propuesto con base en la distancia genética entre las poblaciones silvestres y domesticadas, que la región del Balsas 15 y Jalisco son posibles centros de domesticación para esta especie (Sánchez-de la Vega et al., 2018). A pesar de los esfuerzos realizados para comprender la domesticación de C. argyrosperma, los microsatélites solo pueden revelar la historia demográfica de la especie, por lo que se desconocen las regiones del genoma que se encuentran bajo selección durante su domesticación. El presente estudio consistió en generar un genoma de referencia de alta calidad para el taxón domesticado C. argyrosperma. subsp. argyrosperma y realizar análisis evolutivos a nivel genómico, comparando con su taxón hermano silvestre C. argyrosperma. subsp. sororia para dilucidar la historia demográfica y los patrones de selección entre ambas subespecies debido al proceso de domesticación. También se planteó realizar análisis comparativos con otros genomas del género Cucurbita, al igual que otros genomas de la familia Cucurbitaceae para describir el efecto que tuvo la duplicación completa del genoma (Sun et al., 2017; Montero-Pau et al., 2018) sobre la tasa de evolución de familias génicas codificantes y no codificantes en Cucurbita. 16 OBJETIVOS Los objetivos generales de esta tesis corresponden a los capítulos que la componen, en el siguiente orden: 1. Describir los métodos modernos con los cuales se puede analizar el proceso de domesticación, de manera que se pueda plantear un proyecto de investigación enfocado en la domesticación a nivel genómico. 2. Ensamblar y anotar un genoma de referencia de C. argyrosperma para analizarlo con un enfoque de genómica comparada y evaluar el efecto que tuvo la duplicación completa del genoma sobre la dinámica evolutiva de los genes codificantes y no codificantes en el género Cucurbita. 3. Realizar análisis de genómica poblacional entre la subespecie domesticada (C. argyrosperma subsp. argyrosperma) y la subespecie silvestre (C. argyrosperma subsp. sororia) para dilucidar los eventos demográficos y las presiones de selección que ocurrieron durante el proceso de domesticación de C. argyrosperma. 17 CAPÍTULO 1: ENFOQUES MODERNOS AL ANÁLISIS DE LA DOMESTICACIÓN Artículo de revisión: Genomic, transcriptomic and epigenomic tools to study the domestication of plants and animals: a field guide for beginners Este capítulo es una guía teórica y práctica para el estudio de la domesticación mediante los enfoques más recientes de la genómica, transcriptómica, epigenética y la edición de genomas. Dicha guía es útil para estudiantes e investigadores poco familiarizados con estos enfoques, ya que describe las bases teóricas en cada sección, explica tanto las aplicaciones como su relevancia al estudio de la domesticación y expone ejemplos de cómo se han usado estas herramientas de manera exitosa en otros artículos. El capítulo describe los pasos necesarios para obtener un genoma de referencia y explica las distintas técnicas de secuenciación disponibles para realizar genómica de poblaciones, enfocándose en las ventajas y desventajas de cada una. Se exploran los modelos teóricos actuales y las herramientas necesarias para entender la historia demográfica y los procesos de selección a los fueron sometidos los organismos domesticados. Se expone el análisis de pan-genomas en plantas y animales como un modelo útil para conocer las variantes estructurales responsables de los procesos de domesticación. Se aborda el análisis de ADN antiguo extraído de material arqueológico para resolver preguntas que no son posibles de abordar analizando a las poblaciones de la actualidad. También se describe el diseño experimental necesario para estudiar los cambios en los patrones de expresión transcriptómica derivados de los procesos de domesticación. Se abordan las distintas herramientas disponibles para comprender el papel de los mecanismos de regulación epigenética en los procesos de domesticación. Finalmente, se describe la utilidad de las herramientas de edición genética para validar experimentalmente los genes candidatos de la domesticación, además del potencial uso de estas herramientas para el mejoramiento genético y realizar domesticación de novo. Este capítulo fue publicado en la revista Frontiers in Genetics en el mes de julio del 2020. REVIEW published: 15 July 2020 doi: 10.3389/fgene.2020.00742 Edited by: TingFung Chan, The Chinese University of Hong Kong, China Reviewed by: Eric Von Wettberg, The University of Vermont, United States Martin Mascher, Leibniz Institute for Plant Genetics and Cultural Plant Research (IPK), Germany David Irwin, University of Toronto, Canada *Correspondence: Luis E. Eguiarte fruns@unam.mx Specialty section: This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics Received: 25 April 2020 Accepted: 22 June 2020 Published: 15 July 2020 Citation: Barrera-Redondo J, Piñero D and Eguiarte LE (2020) Genomic, Transcriptomic and Epigenomic Tools to Study the Domestication of Plants and Animals: A Field Guide for Beginners. Front. Genet. 11:742. doi: 10.3389/fgene.2020.00742 Genomic, Transcriptomic and Epigenomic Tools to Study the Domestication of Plants and Animals: A Field Guide for Beginners Josué Barrera-Redondo, Daniel Piñero and Luis E. Eguiarte* Departamento de Ecología Evolutiva, Instituto de Ecología, Universidad Nacional Autónoma de México, Mexico City, Mexico In the last decade, genomics and the related fields of transcriptomics and epigenomics have revolutionized the study of the domestication process in plants and animals, leading to new discoveries and new unresolved questions. Given that some domesticated taxa have been more studied than others, the extent of genomic data can range from vast to nonexistent, depending on the domesticated taxon of interest. This review is meant as a rough guide for students and academics that want to start a domestication research project using modern genomic tools, as well as for researchers already conducting domestication studies that are interested in following a genomic approach and looking for alternate strategies (cheaper or more efficient) and future directions. We summarize the theoretical and technical background needed to carry out domestication genomics, starting from the acquisition of a reference genome and genome assembly, to the sampling design for population genomics, paleogenomics, transcriptomics, epigenomics and experimental validation of domestication-related genes. We also describe some examples of the aforementioned approaches and the relevant discoveries they made to understand the domestication of the studied taxa. Keywords: population genomics, pangenomics, ancient DNA, differential expression analysis, epialleles, genome editing INTRODUCTION The modern study of domestication of plants and animals is multidisciplinary, and relevant contributions come from botany, zoology, archeology, genetics, ethnobiology, biogeography, and linguistics (Larson et al., 2014). Modern domestication studies seek to understand the dates of domestication, the places where domestication started and number of times that domestication took place, as well as the details of the evolutionary and ecological forces that led to the divergence between the domesticated taxa and their wild relatives and ancestors (Zeder, 2006; Larson et al., 2014). Given that domestication is an evolutionary process, genetics emerged as a powerful tool to understand the domestication of plants and animals, revealing the demographic history of the domesticated taxa and the genetic variants that underlie their domesticated phenotypes (Zeder et al., 2006; Gepts, 2014). The advent of high-throughput sequencing technologies sparked the use of genomic studies to understand the domestication of crops and animals in a much deeper Frontiers in Genetics | www.frontiersin.org 1 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication level than previously imagined, as researchers can now pinpoint the genetic changes that allowed domestication to happen (Ross-Ibarra et al., 2007; Gepts, 2014). WHY AND HOW TO USE A GENOMIC APPROACH IN DOMESTICATION STUDIES? TOP-DOWN AND BOTTOM-UP APPROACHES FOR THE STUDY OF DOMESTICATION In genetics, we refer to top or up when referring to a specific phenotype, while we refer to bottom or down when referring to the underlying genotype responsible for that trait. Thus, top-down approaches start by studying a particular phenotype and searching for its genetic basis. Huge advances in the genetic study of domestication traits have been made using classic top-down approaches (e.g., Sax, 1923; Paterson et al., 1988; Doebley and Stec, 1991; Doebley et al., 1995), which are performed by analyzing the phenotypic traits of interest between wild and domesticated taxa, and then finding the genetic variant or variants that correlate with the phenotypic traits through the mapping of quantitative trait loci and linkage disequilibrium (Ross-Ibarra et al., 2007; Kantar et al., 2017). These top-down approaches are precise in finding causal variants involved in the evolution of specific traits, but usually they are very labor-intensive and are biased towards a priori selected phenotypes to be compared between wild and domesticated taxa (Ross-Ibarra et al., 2007; Kantar et al., 2017). In contrast to top-down approaches, bottom-up approaches start by analyzing the genetic variation within genomes in order to detect potential signals of selection related to the domestication process and finally associate such evolutionary signals to important loci and domestication phenotypes (Ross- Ibarra et al., 2007; Kantar et al., 2017). In the last decade, high-throughput sequencing technologies allowed us to analyze entire genomes of one or several individuals of domesticated taxa, and to compare them to different varieties or to their wild relatives (e.g., Hufford et al., 2012; Yang et al., 2012; Li et al., 2013; Wang et al., 2019; Zeng et al., 2019). Bottom-up approaches do not need an a priori phenotypic target, enabling a genome-wide search of domestication-related loci without previous background of possible candidates, revealing important traits that can hardly be studied using a top- down approach (Ross-Ibarra et al., 2007; Kantar et al., 2017). Nevertheless, the results of bottom-up approaches can be limited by the sampling scheme, the density of genetic markers, and the detection of false positives (Tiffin and Ross-Ibarra, 2014), so these genomic approaches have to be properly and carefully designed in order to obtain satisfying results (De Mita et al., 2013). Genomic data facilitated the widespread and reliable use of bottom-up approaches to study plant and animal domestication, but top-down strategies were also aided by genomics, allowing a more efficient search of genotype-phenotype correlations through genome-wide association studies (GWAS; Wang G.- D. et al., 2014), which can be defined as experimental designs that are used to detect the association between genetic variation in a population and phenotypical traits of interest (Visscher et al., 2017). Genome-wide genetic markers allows to differentiate between global and local evolutionary signals occurring throughout the genome (Diao and Chen, 2012), discerning the signals of selection during domestication (Vitti et al., 2013) from other fine-scale signals of demographic events that occurred during the domestication process (Meyer and Purugganan, 2013; Guerra García and Piñero, 2017). The use of modern genomic tools is not limited to population genetics, as other interesting approaches can reveal important aspects of the domestication process. For instance, one can analyze changes in the transcriptional activity of genes related to domestication (Hekman et al., 2015), demonstrate the phenotypic effects of certain alleles through the use of genomic editing tools (Zhou J. et al., 2019), search for epigenetic patterns that changed between domesticated and wild taxa (Janowitz Koch et al., 2016) or analyze the genetic makeup of archeological samples (Irving-Pease et al., 2019). This review describes the necessary steps and data to start a genomic research project towards understanding domestication, the questions that can be approached using genomic data and the main results obtained from previous studies using these methods (Figure 1). WHOLE-GENOME ASSEMBLY AND REFERENCE GENOMES Whole-genome assembly is one of the first steps in modern domestication studies, since it generates a reference genome that is useful for downstream analyses. Whole-genome assembly projects require the use of high-throughput sequencing technologies such as Illumina (e.g., Sun et al., 2017), PacBio (e.g., Badouin et al., 2017; VanBuren et al., 2018), Oxford Nanopore (e.g., Belser et al., 2018) or a combination of these (e.g., Bickhart et al., 2017; Zhou Y. et al., 2019) to sequence the genome of interest of a single individual. Before starting a genome assembly project, a rough estimate of the haploid genome size must be known as well as the ploidy of the organism, since the assembly difficulty and sequencing cost are determined by both factors (Sims et al., 2014). In order to successfully assemble eukaryotic genomes, where repetitive elements usually comprise a significant portion of its content [ranging from 3% in tiny genomes such as Utricularia gibba (Ibarra-Laclette et al., 2013) up to 65.5% in huge genomes such as Ambystoma mexicanum (Nowoshilow et al., 2018)], it is necessary to generate sequencing libraries with large insert sizes – called mate-pair libraries – or use long-read sequencing technologies such as PacBio or Oxford Nanopore (Levy and Myers, 2016; Sohn and Nam, 2016). Additionally, the use of chromosome conformation capture (Mascher et al., 2017), optical mapping (Dong et al., 2013) or linkage maps obtained from crosses (Fierst, 2015) will help achieve chromosome-level assemblies that are highly desirable to adequately assess haplotypes, linkage disequilibrium, putative genomic rearrangements and the Frontiers in Genetics | www.frontiersin.org 2 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication FIGURE 1 | Proposed workflows to study different problems related to the domestication of plants and animals through genomic, transcriptomic and epigenomic tools. genomic location of candidate loci (Sohn and Nam, 2016; see Table 1). After sequencing and assembling the genome of at least one individual, it must be properly annotated before it can be of any use. Since eukaryotic genes are structurally complex, genome assemblies require the additional sequencing of RNA data from the same species to be used as transcriptomic evidence, alongside homology evidence from other curated genomes and ab initio predictions based on the underlying structure of genes, in order to be successfully annotated (Yandell and Ence, 2012; seeTable 1). Even though whole-genome assembly projects were previously restricted to large research groups (e.g., Schnable et al., 2009; Tomato Genome Consortium, 2012), the sequencing cost per nucleotide is declining constantly in all the aforementioned technologies, making genome analyses accessible for a large part of the research community (Muir et al., 2016). The current bottleneck for small research groups is usually not the cost of sequencing itself, but rather the availability of computational resources capable of storing and analyzing huge amounts of data (Muir et al., 2016). The main purpose of assembling a genome in a domestication study is to use it as a reference for high-quality population data to infer the selection, introgression and recombination processes, and to design posterior studies for experimental validation of candidate loci. Even though several population-level analyses based on reduced-representation genome sequencing can be performed in the absence of a reference genome (De Wit et al., 2012; Mastretta-Yanes et al., 2015), the use of a reference genome alongside population data enables the correct identification of otherwise anonymous loci into specific genes or regions within the genome and it makes possible the identification and the proper handling of linkage between loci (Fitz-Gibbon et al., 2017). Also, it can help to discriminate between orthologous and paralogous loci, which is critical given the large size of many genomes and the frequent genome duplication processes experienced during the evolution of plant and animal lineages (Clark and Donoghue, 2018; Zadesenets and Rubtsov, 2018). Thus, the availability of a reference genome is desired for genomic analyses concerning domestication. Luckily, domesticated taxa are usually economically relevant, drawing the attention of several research groups worldwide and in some cases helping to fund the projects. Therefore, reference genomes are usually available for domesticated species, since such data is also relevant for other research areas, such as crop improvement and breeding programs (Ellegren, 2014). However, it should be noted that using a single reference genome can lead to reference bias, where sequenced individuals that are more distantly related to the reference will tend to have fewer predicted variants due to mismatches while mapping the reads (Günther and Nettelblad, 2019). Besides its use as a reference genome for population-level data, the analysis of several whole-genome assemblies between Frontiers in Genetics | www.frontiersin.org 3 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication TABLE 1 | List of key publications with other reviews that are focused on specific topics, as well as some notable examples of research articles using some of the methods described in this review with reliable results. Topic Citation Usefulness/importance Genome assembly and reference genomes Sohn and Nam, 2016 In-depth review on genome assembly. Includes compelling explanations behind the genome assembly algorithms and an extensive list of genome assembly strategies. Yandell and Ence, 2012 In-depth review on eukaryotic genome annotation, a description of the available tools to predict genes and best practices when predicting genes. Sequencing strategies Meirmans, 2015 Classic review concerning common pitfalls that should be avoided in a population genomic study. A compulsory review for any newcomer to population genomics. Dorant et al., 2019 Empirical study that compares the efficiency of Pool-seq, RADseq and Rapture to detect weak signals of genetic structure in lobsters. Inbar et al., 2020 Empirical study that compares the efficiency of whole-genome sequencing, Pool-seq and RADseq for GWAS in ants. Pan-genomics Golicz et al., 2016a In-depth review about pan-genomics in plant species, its advantages over the use of reference genomes, a guide on how to generate pan-genomes and the importance of studying structural variants. The article is dedicated to plants, but the rationale and methods can also be applied to other eukaryotes. Khan et al., 2020 Opinion article detailing the relevance of pan-genomes as a necessary next step from reference genomes. The authors also highlight the importance of including wild taxa into pan-genomics and propose the idea of genus-level super-pan-genomes. Gao et al., 2019 Landmark study of the tomato pan-genome. The authors sequenced 725 accessions from the domesticated tomato and its wild relatives. They found 4,873 additional genes, including several well-characterized genes that were absent from the reference-genome. They also evaluated the presence-absence variants between the wild and domesticated tomatoes, which were enriched in disease-resistance genes. Demographic analyses Linck and Battey, 2019 Research study focused on the effects of minor allele-frequency filters to detect genetic structure in populations. Gives a good explanation on the rationale behind the clustering-based methods to detect structure. Mather et al., 2020 Review dedicated to the theoretical background and technical requirements of PSMS and MSMC to infer changes in effective population sizes and coalescent times. Gerbault et al., 2014 Excellent review on how to use Bayesian approaches to test different demographic models of domestication. Frantz et al., 2015 Landmark study on pig domestication. The authors make use of Approximate Bayesian computation to compare domestication scenarios, they use clustering-based methods to detect genetic structure and used a graph-based method to infer the genetic relationship between pig and wild boar populations. Selection scans Vitti et al., 2013 Good review focused on explaining the rationale behind many of the bottom-up tests to detect selection and the genomic signals they are sensitive to. De Mita et al., 2013; Lotterhos and Whitlock, 2015 Classic simulation-based studies that compare different scenarios to evaluate the best sampling strategies and the most powerful methods to detect selection throughout the genome, according to the reproductive nature of the organism under study. Gibson, 2018 A primer dedicated to understanding the principles behind GWAS and its ability to detect polygenic effects on quantitative traits. Hufford et al., 2012 A landmark paper that illustrates how to perform genome scans to detect domestication-related loci in domesticated taxa, and the importance of these loci for crop improvement. The paper studies the domestication of maize, but a similar study design can be applied to domesticated animals. Paleogenomics Irving-Pease et al., 2019 Exhaustive book chapter dedicated to the study of ancient DNA to understand domestication. Allaby et al., 2019 Research study that casts into doubt the long-lasting idea that domestication processes lead to strong population bottlenecks by re-analyzing data based on ancient DNA samples. Daly et al., 2018 Remarkable study that sequenced and analyzed 83 mitochondrial genomes and 51 nuclear genomes from ancient goat samples. The authors found signals of ancient introgression events, as well as ancient selective signals related to several traits that are shared with modern goats. Transcriptomics Fang and Cui, 2011 General guideline on how to adequately design an RNA-seq experiment to avoid technical mistakes and generate meaningful results. Yang and Kim, 2015 General guideline on how to analyze RNA-seq data to assess differential expression. Hekman et al., 2015 In-depth review dedicated to study the domestication process through transcriptomics, including methodological strategies and challenges. Hradilová et al., 2017 An excellent study that combines transcriptomic data with metabolomic data and morphological data between domesticated and wild peas. The analysis of multi-omic data allowed them to get a better understanding behind seed dormancy and pod dehiscence in domesticated peas. Epigenomics Guerrero-Bosagna, 2012; Heard and Martienssen, 2014; Burggren, 2016 Contrasting views on the role of transgenerational epigenetic inheritance in evolution. The topic is still debated and should be viewed critically. Jensen, 2015 In-depth review on the rationale and advances of epigenetic studies to understand domestication. The manuscript is focused on animal behavior, but many of the ideas can also be applied to domesticated plants. (Continued) Frontiers in Genetics | www.frontiersin.org 4 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication TABLE 1 | Continued Topic Citation Usefulness/importance Janowitz Koch et al., 2016 A landmark paper showing the importance of epigenetic marks on dog domestication and its association with behavioral traits. The study doesn’t just compare the methylation marks between wolves and dogs, but also assess the heritability of the methylation marks and proposes a formal test to detect selection on epialleles. Genome-editing tools Boettcher and McManus, 2015). Review on novel genome-editing techniques and RNA interference. Useful to compare and choose the best tool to validate candidate loci. Shan et al., 2020 A general guide on how to develop a CRISPR/Cas9 system on a non-model plant species. Soyk et al., 2017 Landmark paper that uses genome-editing to validate two candidate genes related to fruit size and reduced fruit dropping in tomato. The authors also detect the emergence of undesirable traits in domesticated tomatoes due to an epistatic effect between both domesticated loci and introduce wild alleles to generate new tomato phenotypes with reduced degrees of the undesirable traits. Perspectives Piperno, 2017 Review centered on the potential application of an extended synthesis framework to understand domestication. Centered around the concepts of niche construction, transgenerational epigenetic inheritance and developmental plasticity. domesticated and wild taxa will help us reveal structural differences between the genome of a domesticated taxon and its closest wild relatives, such as duplications, chromosome rearrangements or presence/absence of entire genes and genomic regions (Yang et al., 2012; Wang W. et al., 2014; Xie et al., 2019). Since selection and bottlenecks during domestication often leads to the fixation of mutations that involve a loss of function (Renaut and Rieseberg, 2015; Moyers et al., 2018), comparative analyses using genome assemblies of wild ancestors may also reveal these changes in genes that could not be properly predicted within the domesticated genome (Moyers et al., 2018). In this sense, further efforts should be made to assemble high-quality genomes of wild relatives alongside the domesticated taxon of interest (Brozynska et al., 2016; Xie et al., 2019). STRATEGIES TO GATHER ADEQUATE POPULATION GENOMICS DATA Genome assemblies alone give us a limited view on domestication, unless several genomes of wild relatives (if known and available) and domesticated individuals are sequenced, because evolution is a population-level process, and in consequence population data is necessary to address most of the evolutionary questions in domestication (Wang G.-D. et al., 2014; Guerra García and Piñero, 2017). Population genomics examines the genetic variation within and between populations that is scattered across the entire genome to assess the demographic history, phylogenetic relations and selective pressures of a species (Jorde, 2001). Several types of genomic data can be evaluated at the population level, including single nucleotide polymorphisms (SNPs), indels and copy number variations; but SNPs are the most commonly analyzed of the three (Seal et al., 2014). All population-level sequencing techniques share common pitfalls that should be known and avoided before investing any money on sequencing. Population sampling should be planned carefully, as the sampling scheme has a stronger impact over sequencing to obtain reliable results in any analysis (Meirmans, 2015). Also, different populations should be mixed, rather than being sequenced on separate libraries or sequencing lanes, as failing to do so will generate sequencing biases that can be confused with biological patterns (Meirmans, 2015; see Table 1). Once adequate genomic population data is gathered, we need to analyze the demographic processes that shaped the genetic variation and the population structure of contemporary populations during the domestication process. This data is necessary to perform tests to detect natural and artificial selection, which are required to understand the genetic base of domestication syndromes (Ross-Ibarra et al., 2007). There are several approaches to obtain population data at a genomic scale, which differ in the fraction of the genome that is sequenced, therefore determining the sequencing cost of each sample (Schreiber et al., 2018). Whole-Genome Sequencing of Populations After assembling a reference genome, one of the next possible strategies to understand domestication is to sequence the complete genome of several individuals. This approach requires the alignment of the sequencing reads back to a reference genome, in order to infer the variable sites between individuals and know the genetic elements (e.g., genes, upstream regulators, repetitive elements, non-coding RNAs) associated to those sites. The main benefit of this approach is its potential to retrieve all the variant sites within an individual’s genome that are structurally represented in the reference genome.Whole-genome sequencing can be used in almost any population-level test of interest (Schreiber et al., 2018). Common practices recommend a sequencing depth around 30× per individual, but empirical studies in pigs suggests that even 10x is enough to cover up to 99% of a genome with accurate detection of variant sites (Jiang L.G. et al., 2019). The main drawback of this approach is the sequencing cost of each sample, which is significantly higher compared with other approaches, especially for organisms with large genomes such as polyploid crops or mammals (Schreiber et al., 2018). This can lead researchers to evaluate a trade-off between sequencing depth and number of sampled individuals to optimize their resources. Simulation studies suggest that sequencing more individuals is more convenient to obtain Frontiers in Genetics | www.frontiersin.org 5 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication reliable results, even at the expense of lower sequencing depths per individual (Fumagalli, 2013). Alternatives to Whole-Genome Sequencing Other approaches aim to reduce the sequencing cost per samples by pooling the DNA of several individuals into a single sequencing library (Futschik and Schlötterer, 2010) or by reducing the portion of the genome that is sequenced (often named as reduced-representation sequencing), either by sequencing arbitrary defined segments scattered across the genome, by targeting the desired portions of the genome or by sequencing the transcriptionally active portions of the genome (Schreiber et al., 2018). These techniques are especially helpful for organisms with very large genomes, and some of these methods can even be used in the absence of a reference genome (Mastretta-Yanes et al., 2015; Schreiber et al., 2018). Furthermore, the reduced representation of the genome means that those fewer regions that are targeted can have a high sequencing depth, leading to higher accuracy of the observed genetic variation and better heterozygosity estimations (Schreiber et al., 2018). Additionally, the reduced sequencing cost per sample allows for a large number of sequenced individuals and populations that, with a proper sampling strategy, can lead to robust results (De Mita et al., 2013; Lotterhos and Whitlock, 2015). Due to the fragmented nature of these sequencing techniques, reduced representation data alonemay be insufficient to pinpoint all or even the most important possible causal genetic variants associated to the domestication syndromes (Lowry et al., 2017), but they are still useful to infer basic genetic statistics, infer demographic properties and past demographic scenarios, detect some signatures of selective sweeps across the genome and even perform GWAS for domestication traits of interest (Andrews et al., 2016; Schreiber et al., 2018). Pool Sequencing Pool sequencing (Pool-seq) is a promising alternative to whole- genome sequencing with a much lower cost (Futschik and Schlötterer, 2010). As the name suggests, Pool-seq consists of sequencing a large pool of individuals for a given population into a single high-throughput sequencing library, instead of sequencing each individual separately, allowing an accurate estimation of allele frequencies and other parameters of population genetics at the expense of losing individual-level information (Futschik and Schlötterer, 2010). This method requires to map the reads against a reference genome of the same species in order to work (Schlötterer et al., 2014). It is intended for sequencing large pools of individuals (>40 individuals per population is recommended, but >100 is optimal), otherwise the allele frequencies will not be estimated accurately (Schlötterer et al., 2014). The relative amount of pooled DNA of each individual in a Pool-seq study should be similar in order to avoid overrepresentation of individual alleles, a task that is often challenging (Schlötterer et al., 2014). Pool-seq has several limitations that should be considered based on the objectives of the research project. It is difficult to discard a low-frequency allele from a sequencing error, but this problem is potentially fixed by either establishing a minor allele frequency threshold for SNP calling or by using pool replicates (Schlötterer et al., 2014). One important limitation is the inability of Pool-seq data to estimate linkage disequilibrium and haplotype phasing, which is particularly important to evaluate the non-independence of genetic signals in demographic studies and selective scans (Schlötterer et al., 2014). Finally, assessing genetic structure can be difficult and sometimes misleading when using Pool-seq, due to potential biases in individual allele representations within the pool (Dorant et al., 2019). This makes Pool-seq an adequate method for GWAS, selective sweeps and some methods based on allele frequencies when resources are limited (Luu et al., 2017; Inbar et al., 2020), but the loss of individual-level information makes many of the demographic inferences difficult, as populations need to be predefined before sequencing (Dorant et al., 2019), and the bioinformatic tools that handle Pool-seq data are scarce. Exome Capture and Sequencing Exome sequencing is another lower-cost alternative to whole- genome sequencing which targets the protein-coding regions of the genome (Warr et al., 2015; Kaur andGaikwad, 2017). Protein- coding genes represent a small fraction of eukaryotic genomes, which is particularly useful for most population genomic studies, since it represents mostly functional elements within genomes (Kaur and Gaikwad, 2017). This technique is usually performed using hybridization probes, which requires previous knowledge of the genome content as well as a priori selection of regions of interest in order to design probes (Kaur and Gaikwad, 2017). Fortunately, hybridization probes are already available for several domesticated plants and animals (Warr et al., 2015; Kaur and Gaikwad, 2017). Despite its advantages, exome sequencing can generate an uneven sequencing depth in certain genomic positions, unlike whole-genome sequencing that shows a uniform distribution of reads throughout the genome (Lelieveld et al., 2015). Another important limitation of exome sequencing is its bias towards the protein-coding portion of the genome, since increasing evidence shows that many of the genetic changes that have been directly associated to domestication traits are located within cis- regulatory elements, noncoding RNAs and other trans-regulatory elements, rather than within the open reading frame of the genes (Swinnen et al., 2016). Despite its limitations, demographic history and selective sweeps can still be detected using this sequencing method (Pankin et al., 2018). RNA Sequencing of Populations Transcriptome sequencing (also known as RNA-seq) is another useful approach to obtain population-level data from the transcriptionally active elements within genomes (De Wit et al., 2012). RNA-seq can be mapped against a reference genome to detect genetic variants and determine the genomic regions of interest, but it can also be analyzed in the absence of a reference genome (De Wit et al., 2012), since transcriptomes can be assembled de novo (Haas et al., 2013) and the functional Frontiers in Genetics | www.frontiersin.org 6 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication annotation of the assembled transcripts is relatively easy (Bryant et al., 2017). However, transcription profiles are dependent on the sequenced tissues and organs, the development stage of the organism, and the influence of external stimuli, capturing just the transcripts that are active at the moment of RNA extraction (Hekman et al., 2015). This complexity can generate important biases in the relative abundance of certain transcripts over others and overlook potential adaptative genes whose expression are context dependent (Hekman et al., 2015; Kaur and Gaikwad, 2017). Nonetheless, RNA-seq is still a good option for species with large genomes that are hard to assemble (De Wit et al., 2012). Similarly to exome-sequencing, RNA-seq data can be used to evaluate demographic history and selective sweeps, but the selective signals are restricted to the transcriptionally active part of the genome, and cannot be used to evaluate structural variants (Schreiber et al., 2018). Restriction Site-Associated DNA Sequencing Restriction site-associated DNA sequencing (RAD-seq), which may also be referred to as genotyping by sequencing (GBS), has been one of the most popular options for cost-affordable population genomics in the last decade (Davey and Blaxter, 2010). The technique consists in using restriction enzymes to digest the DNA and sequence the regions adjacent to the restriction sites that are scattered across the genome (Davey and Blaxter, 2010). It can also be combined with sequence capture techniques to target specific loci of interest (Ali et al., 2016). RADseq data can either be mapped against a reference genome or it can be assembled de novo (Catchen et al., 2013; Mastretta-Yanes et al., 2015), making it a versatile technique for species with scant genomic resources. However, empirical studies show that using certain de novo approaches for RAD-seq data can lead to fewer predicted SNPs due to errors in the definition of loci and treatment of sequencing errors (Shafer et al., 2017), all which may subsequently alter downstream analyses, especially those based on the distribution of allele frequencies within the genome of a population, also known as the site frequency spectrum (SFS) (Shafer et al., 2017). For this reason, a reference-based approach is highly recommended as long as the reference genome is closely related to the population dataset (Shafer et al., 2017). Furthermore, RADseq data could involve errors when a polymorphism resides within a restriction site, which prevents the enzyme to cut in individuals carrying such polymorphism, leading to failures in sequencing that region in homozygous individuals (null alleles) and makes heterozygous individuals to look like homozygotes (allele dropout) (Andrews et al., 2016). Finally, the capacity of RADseq libraries to adequately perform selective scans has been casted into serious question (Lowry et al., 2017). Its potential capacity to detect selective sweeps is dependent on the genome size, the density of variants detected for a given genomic region and specially the length of the extent of linkage disequilibrium in the genome (Lowry et al., 2017). Thus, when a species genome has short regions in linkage disequilibrium (due to high recombination rates) and the SNP density is low (particularly in large genomes), odds are that the selective scans will likely miss a significant portion of selective sweeps associated to domestication (Lowry et al., 2017). PAN-GENOME ANALYSES IN DOMESTICATED AND WILD TAXA An increasing number of studies are revealing that structural variants (copy-number variation, presence/absence of genomic regions, inversions, transversions, translocations) are common within plant and animal populations (Khan et al., 2020). Thus, the use of a single reference genome hampers our ability to study the full repertoire of genetic variation within a species (Golicz et al., 2016a; Zhao et al., 2018). Structural variants such copy-number variation can contain functional genomic elements that are usually under relaxed selective pressures and can serve as the basis of adaptation given specific environments and selective regimes (Lye and Purugganan, 2019). Coincidentally, copy-number variation and other structural variants play an important role in the emergence of domestication traits, as well as diversification traits in landrace varieties (Lye and Purugganan, 2019). Some studies estimate that at least one third of the known domestication loci are structural variants, and up to one in seven genes can be hemizygous (i.e., with one copy) in grapevine individuals (Zhou Y. et al., 2019). Despite its importance, structural variants cannot be properly analyzed using any of the aforementioned techniques. This led the research community to adopt the concept of the pan-genome, an idea that first appeared in microbiology (Tettelin et al., 2005), into the study of plant and animal genomes (Golicz et al., 2016a). The concept of pan-genome rests on the idea that the genomes of individuals within a population or species share a core set of genes that unifies them (i.e., the core genome), but also contain a fraction of genes that are absent from one or more individuals (i.e., the accessory or dispensable genome), which altogether give rise to the pan-genome of such population or species (Tettelin et al., 2005). There are three main methods to generate a pan-genome: the alignment and comparison of multiple de novo genome assemblies, the iterative assembly of several genomes from an initial reference or the use of de Bruijn graph assemblers to jointly assemble several genomes (Golicz et al., 2016a; see Table 1). Since domestication reduces the genetic diversity of a taxon, often eliminating portions of the dispensable genome that contain genes involved in local adaptation, the use of wild relatives is crucial to generate a representative pan-genome for a species (Khan et al., 2020). Once a pan-genome is generated, it can be used alongside whole-genome sequencing data to analyze the structural variants between and within populations, revealing novel loci involved in the development of domestication-related traits that would have stayed hiddenwhen using a single reference genome (Li et al., 2014; Zhao et al., 2018). Besides, the use of a pan-genome alleviates the inherent reference biases of a single reference genome (Günther and Nettelblad, 2019). Pan-genome studies have revealed additional selective sweeps and structural variants associated to the domestication process, Frontiers in Genetics | www.frontiersin.org 7 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication which were not identified using sequencing data with a single reference genome (Li et al., 2014; Zhao et al., 2018). Pan- genomes are already available for several species (Figure 2) such as maize (Brohammer et al., 2018), wheat (Montenegro et al., 2017), Brassica oleracea (Golicz et al., 2016b) or Brassica napus (Hurgobin et al., 2018); and pan-genome analyses to study domestication have already been performed in soybean (Li et al., 2014), rice (Zhao et al., 2018), sunflower (Hübner et al., 2019) and tomato (Gao et al., 2019). While current eukaryote pan-genome analyses are focused on plant species (Golicz et al., 2016a, see Table 1) and goats (Li et al., 2019), other livestock researchers may soon venture into this field. As sequencing technologies become cheaper, multiple pan-genomes from different species of the same genus should eventually be combined to create a super- pan-genome that represents the entire genetic content available in a genus with one or more domesticated taxa, as it would include the diversity of all their wild relatives (Khan et al., 2020). POPULATION GENETICS AND DEMOGRAPHIC ANALYSES OF THE DOMESTICATION PROCESS Demography and population size changes during the domestication process is tightly related to unraveling some of the most fundamental questions of the domestication process. These analyses can help answer questions such as possible centers of origin and diversification, patterns of migration and expansion throughout these centers, gene flow between domesticated and wild taxa, number of domestication events, the extent of genetic erosion in the domesticated taxon, levels of global genetic differentiation between wild and domesticated taxa, the patterns of adaptive and neutral introgression among them, and in some cases even the number of generations that have elapsed since domestication and other processes such as differentiation and local adaptation of domesticated taxa (Meyer and Purugganan, 2013; Guerra García and Piñero, 2017). Genetic Diversity in Populations A first necessary step for the SNP data is to extract and compare the summary statistics of population genetics within and between populations (Andrews et al., 2016). This information describes the genetic diversity in populations, including the estimate of allele frequencies (usually denoted as p or the frequency of the most abundant allele), observed heterozygosity (HO), expected heterozygosity (HE), nucleotide diversity (π), number of segregating sites (S) and number of private alleles (i.e., alleles only found in one population). These summary statistics can reveal the level of genetic erosion in domesticated plants and animals when compared to the ancestral wild population, which is expected due to severe bottlenecks, selective sweeps and inbreeding (Groeneveld et al., 2010; Gepts, 2014). One should be aware that reference bias can influence the relative genetic variation observed between the wild and domesticated populations, which could be alleviated using more than one reference or using a pan-genome (Günther and Nettelblad, 2019). Population Structure It is also important to describe the population structure (i.e., the genetic differentiation among populations) of domesticated taxa and of their wild relatives, as it can reveal the influence of historical events that shaped the genetic diversity of the organisms (Linck and Battey, 2019). The level of population structure between wild and domesticated taxa can be determined by several factors, such as the number of generations since domestication started, the intensity of the selective pressures imposed to the domesticated taxon, the intensity of the bottlenecks suffered though the domestication process, and the frequency of gene flow between the domesticated taxon and its wild relative (Meyer and Purugganan, 2013). The F-statistics are classic estimates of population genetics that are based on the heterozygosity values within and among populations, which can reveal patterns of inbreeding, gene flow and differentiation between and within populations (Andrews et al., 2016). Of these, the FST statistic is of particular interest, since it can be used to detect population structure between wild and domesticated populations, or between different domesticated varieties (Andrews et al., 2016). These estimates are relatively simple to calculate, but they require a priori assignment of individuals to discrete populations, which may be wrongly assigned, may not reflect natural populations or may simply be unknown (Linck and Battey, 2019). Methods based on population clustering have become popular for describing genetic structure, as they do not require a priori population assignment. These clustering methods can be classified into parametric and non-parametric methods (Linck and Battey, 2019). Parametric methods, also known as model-based methods, assign individuals into a predefined number of K populations based on their genotypes and the allele frequency of each locus (Pritchard et al., 2000). Several parametric methods have been described that successfully analyze genomic datasets to infer population structure (e.g., Tang et al., 2006; Alexander et al., 2009; Raj et al., 2014), but one has to be careful when using them, as they assume linkage equilibrium and Hardy-Weinberg equilibrium in the dataset (Linck and Battey, 2019), so SNPs should be filtered accordingly before these methods can be confidently used (Wigginton et al., 2005; Mathew et al., 2018). Furthermore, parametric methods have been found to be susceptible to changes in the SFS generated by minor allele frequency thresholds that are commonly used to filter population genomics data because low-frequency polymorphisms are expected to contain information about recent events, which adds uncertainty to the assignation of individuals in populations that reflect ancient demographic events (Linck and Battey, 2019). Non-parametric methods include principal component analyses, discriminant analyses of principal components and K-means clustering. These methods define populations and genetic structure by transforming the genetic data into uncorrelated variables – named eigenvectors or principal components – to identify groups within the dataset (Patterson et al., 2006; Jombart et al., 2010; Linck and Battey, 2019). Non- parametric methods were designed to work with large amounts Frontiers in Genetics | www.frontiersin.org 8 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication FIGURE 2 | Examples of three pan-genomes of domesticated taxa. The citations included correspond to the publications of the original reference genomes and the subsequent pan-genome assemblies. The inner circles on the right represent the content of the reference genome, while the outer circles represent the additional nonreference content retrieved with the pan-genome. Some examples of important genes are show for the nonreference part of each pan-genome. (images obtained from Openclipart). of genomic data (Patterson et al., 2006; Jombart et al., 2010) and they are more robust to changes in the SFS than the parametric methods, so it is recommended to run both types of methods and compare their results before making further inferences (Linck and Battey, 2019). Inferences in Changes of Population Sizes Throughout Time An important aspect of the demographic history of domesticated taxa is the analysis of the change in the effective population size (Ne) in the populations throughout time (Chen J. et al., 2018). The concept of Ne reflects the estimated populations size in a Wright-Fisher model given an observed genetic variation, so these estimations hardly reflect the census population size of real populations (Charlesworth, 2009), and can also be affected by reference biases and allele dropouts. Changes in Ne can reveal or at least hint on the demographic history of taxa throughout the domestication process, such as expansions or bottlenecks. These changes can help to understand other evolutionary aspects of domestication concerning natural and artificial selection, such as the efficiency of selection and the accumulation of deleterious mutations in domesticated taxa (Chen J. et al., 2018; Allaby et al., 2019). The domestication process is expected to include a bottleneck as a consequence of subsampling the genetic diversity in the wild ancestor, followed by a population expansion as domesticated taxa diversify (Meyer and Purugganan, 2013), although this idea has been recently challenged by paleogenomic studies (Allaby et al., 2019). Many methods exist to explore the changes in Ne throughout time, whose approach sometimes depends on the type of data available. It should be noted that all the methods to infer historical changes in Ne are susceptible to predicting false bottlenecks when populations are structured, so as indicated above, genetic structure should be evaluated and properly accounted for (Nielsen and Beaumont, 2009). Studies with few individuals and high sequencing depth may use the Pairwise Sequentially Markovian Coalescent model (PSMC; Li and Durbin, 2011) or the Multiple Sequential Markovian Coalescent model (MSMC; Schiffels and Durbin, 2014) to analyze the demographic history of domesticated and wild taxa. The PSMC and MSMC models can infer changes in Ne throughout time (bottlenecks and expansions) by calculating the distribution of the time of coalescence between all the heterozygous loci in complete diploid genomes (Li and Durbin, 2011; Schiffels and Durbin, 2014). These models can also calculate the time of coalescence (i.e., separation, and in some cases the domestication time) between two genomes given a Frontiers in Genetics | www.frontiersin.org 9 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication specified mutation rate, recombination rate and generation time (Li and Durbin, 2011). However, the genomes used in PSMC or MSMC must be of very good quality, having an average sequencing depth of the very least 18x, at least 10 reads per site, and less than 25% of missing data (Nadachowska-Brzyska et al., 2016). Besides, PSMC has several limitations compared to other estimators of Ne and is particularly susceptible to predicting false bottlenecks when populations are structured (Mazet et al., 2015). Nevertheless, this can be properly handled by comparing models of instantaneous Ne size change against models of classical symmetric islands using a maximum-likelihood approach (Mazet et al., 2015). Multiple Sequential Markovian Coalescent can infer more recent changes in Ne compared to PSMC (Schiffels and Durbin, 2014), so it may be convenient to explore recent demographic expansions in diversified domesticated taxa (Allaby et al., 2019). For example, MSMC was used to infer population bottlenecks in East Asian and Western Eurasian dogs, as well as divergence times between wolves and dogs around 60,000–20,000 years ago (Frantz et al., 2016), while PSMC was used to determine a severe bottleneck in African rice around 15,000–13,000 years ago (Meyer et al., 2016). Other methods rely on population data at a genomic scale from many (sometimes hundreds) individuals (as obtained from exome sequencing or RAD-seq), namely the extended Bayesian skyline plots (Heled and Drummond, 2008; Trucchi et al., 2014) and the stairway plots (Liu and Fu, 2015). Since Ne is a crucial concept in coalescent theory, extended Bayesian skyline plots and stairway plots rely on the SFS calculated from the population data to estimate Ne (Heled and Drummond, 2008; Liu and Fu, 2015). The inferences made from these two methods are comparable to those obtained from PSMC and MSMC, although they rely on different kinds of datasets (Liu and Fu, 2015). Furthermore, stairway plots are more efficient in inferring recent demographic history, whereas PSMC is more reliable for ancient demographic events (Liu and Fu, 2015). Estimating Gene Flow and Introgression Between Populations Ancient gene flow and local ancestry (i.e., the genetic ancestry of an individual for an specific chromosomal position; Thornton and Bermejo, 2014) are also important aspects of plant and animal domestication that need to be addressed, since they can describe the genetic contribution of different ancestral populations in the genomic architecture of extant populations, such as wild and domesticated taxa (Price et al., 2009; Pickrell and Pritchard, 2012). One approach to assess ancient gene flow are graph-based methods that incorporate the possibility of ancient gene flow between distantly related populations (Pickrell and Pritchard, 2012). This type of methods represents the relationships between populations as a bifurcating tree, where internal nodes can also be interconnected forming a graph that represents ancient gene flow that contributed to modern genetic variation (Pickrell and Pritchard, 2012). For example, graph-based analyses have revealed constant gene flow between sympatric populations of domesticated and wild pearl millet (Burgarella et al., 2018), constant gene flow between domesticated and wild pigs (Frantz et al., 2015) but lack of hybridization events between wild and domesticated populations of goats and sheep (Alberto et al., 2018). Another popular test to infer ancient admixture is the ABBA- BABA test, also known as the D-statistic, which evaluates the allelic patterns of three taxa and compares them to an outgroup to identify genomic regions with an excess of shared derived variants that are not concordant to the species tree (i.e., ABBA- BABA patterns), which suggest introgression events (Durand et al., 2011). The f∧ d test, which is derived from theD-statistic, can help discriminate between introgression events and nonrandom mating in ancestral structured populations (Martin et al., 2015). The D-statistic is sensitive to both introgression and incomplete lineage sorting, so both signals can be separated by testing deviations in the symmetry of branch lengths between the gene trees and the species tree (Edelman et al., 2019). By the same logic, the D3 test can also infer introgression events by analyzing the symmetry in branch lengths, without the need for an outgroup (Hahn and Hibbins, 2019). The D-statistic has been used to infer several introgression events between species of the Bos genus during domestication (Wu et al., 2018). On the other hand, local ancestry methods can reveal which chromosomal segments in the genome were inherited from different ancestral source populations (Price et al., 2009). These methods use the data obtained from linkage disequilibrium between loci to assign ancestry in each portion of the genome in comparison to reference populations that depict ancestral source populations, requiring an a priori assignation of unadmixed reference populations in order to assign local ancestry to the populations of interest (Price et al., 2009). The analysis reveals chromosomic blocks that can be assigned to either a wild or a domesticated ancestry in hybrid populations, which may reveal historical processes of introgression and local adaptation in modern domesticated populations, as well as potential targets for selective breeding (Janzen et al., 2019). Many methods exist that can infer local ancestry using genome-wide population data, and all of them require a high-quality reference genome (preferably assembled at a chromosome-level) in order to detect the ancestry of chromosomal segments (e.g., Price et al., 2009; Baran et al., 2012; Maples et al., 2013; Dias-Alves et al., 2018). For example, a local ancestry analysis of East Asian domestic cattle revealed introgressed blocks inherited from ancient banteng and yak populations that contained genes enriched in sensory perception of smell, transmembrane transport and antigen processing (Chen N. et al., 2018). Using Demographic Simulations to Infer Domestication Scenarios The previous descriptive tools can help us explore possible evolutionary and demographic scenarios in the absence of a priori hypotheses (Liu and Fu, 2015). However, for domesticated taxa we usually have additional classic botanical, zoological, Frontiers in Genetics | www.frontiersin.org 10 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication morphological, paleoclimatic, archeological, ethnobiological and biogeographical data that may suggest some likely scenarios (Gerbault et al., 2014). Thus, demographic modeling can be used to test explicit demographic scenarios by comparing simulations of SFS in such scenarios to the observed data (Gerbault et al., 2014; Liu and Fu, 2015). There are many methods available for demographic modeling, which can be more suitable depending on the type of scenarios that need to be tested (Anderson et al., 2005; Gutenkunst et al., 2009; Excoffier and Foll, 2011; Cornuet et al., 2014). All these methods rely on some basic tenets of coalescent theory (Liu and Fu, 2015), so they are also susceptible to possible biases in the observed genetic variation in the populations. For example, the approximate Bayesian computation (ABC) method compares the summary statistics of several simulated scenarios against the observed data to accept or reject certain demographic hypotheses (Cornuet et al., 2014; Gerbault et al., 2014). This method can help us determine certain parameters of our models and can be used with genome-wide datasets (Cornuet et al., 2014). Other methods based on diffusion approximation can help us infer the demographic history of multiple populations and their interaction through migration and admixture using biallelic SNP data (Gutenkunst et al., 2009). Demographic modeling has helped test the number of domestication events as well as intercontinental migratory events in cattle (Pitt et al., 2019). Coalescent simulations have supported a common origin for all the domesticated varieties of pearl millet (Burgarella et al., 2018), while the ABC method has revealed that the most likely scenario in the domestication of the scarlet runner bean consists of a single domestication event around 21,000 years ago with a mild bottleneck effect (Guerra-García et al., 2017). IDENTIFYING GENES UNDER SELECTION DURING DOMESTICATION Demographic processes are important to understand the general history that led to the domestication of plant and animal taxa, but many studies are specially interested in finding the selected genes that explain the phenotypic differences between domesticated taxa and their wild counterparts (Wang G.-D. et al., 2014; Kantar et al., 2017). Indeed, the detection of these genes under selection during domestication is critical to understand the genetic basis of domestication syndromes, especially for detecting genetic variation relevant for future improvement and selective breeding (Hufford et al., 2012). When a genetic variant increases its frequency due to positive selection (i.e., selection favoring the fixation of a new allele), the adjacent alleles (i.e., physically connected in the same chromosomal region) also increase their frequency in a process known as hitchhiking (Smith and Haigh, 2007). Once the genetic variant under selection reaches a high frequency or fixation, the hitchhiking effect reduces or even eliminates the genetic variation around the selected locus, producing what is known as a selective sweep (Vitti et al., 2013; Pavlidis and Alachiotis, 2017). The size and intensity of a selective sweep depends on the rate of recombination in the genome, and on the intensity of the selective pressure (Smith and Haigh, 2007), which may be weaker in conscious selection compared to some cases of natural selection (Fugère and Hendry, 2018; Yang et al., 2019). Luckily, the signals of a selective sweep can be detected when the selection event occurred “recently” in an evolutionary timescale, as it is the case for domestication (Vitti et al., 2013). Different bottom-upmethods using population genomics data have been developed to detect the regions in the genome that were selected for during domestication, which we will refer to as candidate loci. We can mention methods for detecting regions with higher population differentiation compared to the rest of the genome, methods for detecting local changes in the SFS throughout the genome, and methods that detect extended regions with strong linkage disequilibrium compared to other haplotypes in the genome (see Supplementary Table S1 for a summary of methods to detect selective sweeps). Alternatively, a GWAS can be performed to detect the association of a genetic variant to a specific phenotype of interest (Wang G.-D. et al., 2014). FST Outlier Tests to Detect Candidate Genes Besides the standard use of FST to detect global population structure, the FST statistic can also be used to detect signals of selective sweeps between populations, namely between wild and domesticated taxa (Gepts, 2014). While a global FST statistic (involving all the analyzed loci or SNPs) can reveal the overall genetic structure between populations, a local FST statistic calculated for each locus or SNP along the genome can evaluate whether particular regions of the genome are more differentiated from what is expected due to demographic processes, which can be interpreted as signals of a selective sweep (Nei and Maruyama, 1975). Many different methods exist that are based on the FST statistic, which are collectively known as FST outlier tests (Foll and Gaggiotti, 2008; Excoffier et al., 2009; Bonhomme et al., 2010; de Villemereuil and Gaggiotti, 2015; Lotterhos and Whitlock, 2015), that differ mainly on the underlying model used to calculate the null distribution of the FST values, and thus its ability to detect outliers (Supplementary Table S1). FST outlier tests are able to detect selective pressures following a bottom-up approach, but their efficiency is determined by a multitude of factors that should be carefully accounted for before using them, such as the sampling scheme used to obtain the population data, the total size of the dataset (i.e., number of populations, of individual per population and of SNPs analyzed), the intensity of the selective pressure, the selfing or allogamous nature of its sexual reproduction, and the migration patterns and genetic structure among populations (De Mita et al., 2013; Lotterhos and Whitlock, 2014, 2015). Some successful examples in the use of FST outlier tests include the detection of domestication candidate genes in apple involved in fruit development, size, acidity and sugar metabolism (Khan et al., 2014), the finding of candidate domestication genes involved in metabolism and oil biosynthesis in sunflower Frontiers in Genetics | www.frontiersin.org 11 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication (Baute et al., 2015), the description of candidate diversification genes between pig breeds associated to the shape of the skull (Wilkinson et al., 2013), and the identification of candidate loci between wild and domesticated salmon strains involved in body weight, condition factor, male maturation and a brain related protein (Vasemägi et al., 2012). Site Frequency Spectrum Based Tests to Detect Selective Sweeps Selective sweeps alter the SFS that would be expected under neutral evolution processes because of the reduction in the genetic diversity around the loci under selection (Vitti et al., 2013). The genomic region under selection skews the SFS into an excess of high frequency derived alleles when the selective sweep was recent, since the alleles that were linked to the favored selected locus also reach high frequencies (Fay and Wu, 2000). However, after all the high-frequency alleles reached fixation, the genomic region under the selective sweep will have little to no variation, while mutations will slowly generate new allelic variants, skewing the SFS into an excess of low frequency variants (Zeng et al., 2006). Several tests have been developed to detect skews in the SFS, each of them capable of detecting changes in different parts of the SFS (Supplementary Table S1), making them complementary to one another (Zeng et al., 2006; Vitti et al., 2013). Even though SFS based tests are powerful tools to detect selection, it is important to remember that the SFS at the global genomic scale is also altered by demographic events such as bottlenecks that produces an excess of low frequency variants, and expansions that generates an excess of intermediate frequency variants (Vitti et al., 2013). Thus, it is mandatory to have a previous prediction of the demographic history of the populations in order to properly adjust the null hypothesis in each test (Ross-Ibarra et al., 2007). The well-known summary statistic called Tajima’s D is sensitive to changes in low-frequency variants, making it particularly useful to detect selective sweeps before and after the selected locus reaches fixation, although low-frequency variants can also be observed in loci under purifying selection (Tajima, 1989; Zeng et al., 2006). Tajima’s D is also sensitive to intermediate-frequency alleles, making it useful to detect balancing selection (Tajima, 1989) or even some forms of soft selective sweeps generated by standing genetic variation (Prezeworski et al., 2005). Conversely, Fay and Wu’s H is sensitive to changes in high- frequency variants, which are only altered by positive selection, making it very useful when used alongside Tajima’s D (Fay and Wu, 2000). Unlike Tajima’sD, Fay andWu’sH needs an outgroup species in order to differentiate ancestral alleles from derived alleles and thereby to know whether the derived alleles are at high or low frequencies (Fay and Wu, 2000). Zeng et al. (2006)’s E is sensitive to both low and high frequency variants, making it particularly powerful to detect selective sweeps before or after the selected locus reached fixation, also needing an outgroup in order to differentiate derived alleles from ancestral alleles). There are some tools available to implement SFS based tests using genome-wide data, that can perform all the above tests (i.e., Korneliussen et al., 2013, 2014; Rozas et al., 2017). For example, Tajima’s D test was used alongside other methods to detect selective sweeps associated to the domestication of yaks (Qiu et al., 2015), Zeng’s E test helped discover 125 selective sweeps associated to the domestication of horses (Librado et al., 2016), and the complementary implementation of Tajima’s D, Fay and Wu’s H and Zeng’s E revealed several candidate genes that share similar functions between peach and almond (Velasco et al., 2016). The reduction of diversity (ROD) test is another popular SFS- basedmethod to detect selective sweeps that has been particularly useful for the study of domestication (Supplementary Table S1). ROD compares local π values of domesticated taxa against the local π values of its wild relatives, using sliding windows alongside the genome (Guo et al., 2012; Huang et al., 2012; Qi et al., 2013; Schmutz et al., 2014). The ROD method has been used to successfully detect candidate domestication genes in rice (Huang et al., 2012), watermelon (Guo et al., 2012), cucumber (Qi et al., 2013), common bean (Schmutz et al., 2014), and chickpea (Varshney et al., 2019), to name a few. Linkage Disequilibrium (LD) Based Methods to Detect Selection Given that selective sweeps remove the variation in regions adjacent to the locus under selection, they can form haplotype blocks that extend in strong LD compared to other haplotypes in the same locus because they reached amedium-to-high frequency in the population swift enough so they are not yet disrupted by recombination (Sabeti et al., 2002; Vitti et al., 2013). This pattern has been exploited to develop several methods based on LD to detect selective sweeps of recent origin (Vitti et al., 2013). Interestingly, LD-based methods are sensitive enough to detect both strong and soft selective sweeps (Garud et al., 2015), as well as partial or incomplete selective sweeps (Vitti et al., 2013), making them excellent tools to study recent and ongoing selection events, such as those occurring during domestication and the subsequent diversification of landraces (Supplementary Table S1). Since the above rationale relies on LD decay due to recombination, any method based on LD requires to control for local variation in recombination rates in order to reduce false positives (Sabeti et al., 2002). The extended haplotype homozygosity (EHH) is a widely used statistic in LD-based methods that is defined as the probability that two orthologous genomic regions carrying a “core” haplotype of interest (i.e., the part of the haplotype that is shared by all the individuals carrying it, such as the allele under positive selection) in the population are identical by descent (i.e., they were inherited by the same ancestor), as one looks to a specified distance farther away from the core region (Sabeti et al., 2002). Among the LD based methods that uses the EHH, we can mention the long-range haplotype (LRH) test, sometimes named the relative EHH (rEHH) test, which controls for local Frontiers in Genetics | www.frontiersin.org 12 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication recombination rates by comparing the EHH of several haplotypes localized within the same locus (Sabeti et al., 2002). Other EHH based methods include the whole-genome long-range haplotype (WGLRH) test that uses sliding windows to perform the LRH test (Zhang et al., 2006), the long-range haplotype similarity (LRHs) test (Hanchard et al., 2006), the integrated haplotype score (iHS) which is particularly sensitive to incomplete selective sweeps and soft sweeps (Voight et al., 2006) and the cross-population extended haplotype homozygosity (XP-EHH) statistic that is able to detect selective sweeps after the selected allele reached fixation (Sabeti et al., 2007). The iHS and the XP-EHH statistics can be regarded as complementary to each other, enabling the detection of incomplete and complete selective sweeps in the target population (Vitti et al., 2013). All the LD-based tests that make use of the EHH statistic require the previous phasing of the chromosomes in order to work (i.e., assignation of alleles in an individual to their corresponding maternal and paternal haplotypes), which may or may not be possible depending on the sequencing depth and type of data available for the analysis (Delaneau et al., 2013). For instance, a reference genome is usually needed in order to phase genotypes, since most methods rely on the information of proximity between alleles and their distribution within individuals in a population to assign haplotypes (Delaneau et al., 2013) although new methods are emerging that can phase genotypes without a reference genome (Money et al., 2017). There are other LD-based methods that do not make use of the EHH statistic, such as the LD decay (LDD) test, which rely on individuals that are homozygous for any given SNPs to look for LD differences between alleles in a population (Wang et al., 2006) or the ω statistic that scans for high SNP correlation coefficients around a site under selection (Kim and Nielsen, 2004; Alachiotis et al., 2012). Another method that do not require chromosome phasing is the regression-based test, which relies on the reduction of heterozygosity as one approaches the locus under selection in a genome to infer selective sweeps (Wiener and Pong-Wong, 2011). Other LD-based methods exploit the estimation of identity-by- descent using genome-wide data to detect haplotypes that are shared between several unrelated individual (> 10 generations) to infer selective sweeps without previous knowledge of the pedigree of individual (Han and Abney, 2013), so they might prove useful to study recent domestication processes. Some examples of LD-based methods used to explore the domestication process includes an analysis using LRH to detect signatures of selection associated to dairy and beef cattle breeds (Bomba et al., 2015), a study using the XP-EHH statistic to find signals of selective sweeps in Jinhua pigs (Li et al., 2016), and a paper focused on the diversification of goat landraces that calculated the iHS and the XP-EHH statistics alongside other tests to detect selective sweeps between goat breeds (Bertolini et al., 2018). Other important tests include the XP-CLR test (Chen et al., 2010) and the µ statistic (Alachiotis and Pavlidis, 2018) which implement multiple signatures to detect selective sweeps (Supplementary Table S1) and have been used to detect candidate loci in maize and African rice, respectively (Hufford et al., 2012; Ndjiondjop et al., 2019). Using GWAS to Detect Domestication-Associated Loci Genome-wide association studies have been used extensively to uncover the genetic variants that underlie domestication traits (Shi and Lai, 2015). The domestication traits that can be analyzed through a GWAS can encompass any biological characteristic from simple morphological traits (Jiao et al., 2012) to the production of certain metabolites (Shang et al., 2014), tame behavior in animals (Ilska et al., 2017), resistance or susceptibility to certain diseases (Wang et al., 2012), or adaptation to certain environmental conditions (Song et al., 2018). An important advantage of the GWAS over the bottom-up approaches is its ability to detect polygenic effects on single traits of interest, which is commonplace considering that genes interact between them and the environment to generate phenotypes (Gibson, 2018). A prerequisite before preforming a GWAS is to have large sample sizes in both the number of sequenced genetic variants and the number of individuals included in the study, as they are necessary to obtain the statistical power to detect variants with small effects and to reduce the risk of false positives (Wang G.-D. et al., 2014). Some recent examples include the use of a GWAS to identify candidate genes with unknown functions involved in several agronomic traits, including drought and heat tolerance in chickpea (Varshney et al., 2019); a GWAS that revealed loci associated to fruit size and quality in peach (Cao et al., 2019); and a GWAS that uncovered the genetic variants involved in the absence of anthocyanin in domesticated rice compared to its wild relative (Zheng et al., 2019). ANCIENT DNA AND PALEOGENOMICS OF DOMESTICATED TAXA Extant domesticated taxa lack the information of ancient genetic diversity that was lost through bottlenecks, selection and genetic drift (Ramos-Madrigal et al., 2016). However, the analysis of ancient DNA can allow the research community to overcome some of these limitations (Irving-Pease et al., 2019). Ancient DNA retrieved from archeological sites allows the study of the rate at which domestication happened, as well as revealing which genes were important at the beginning of this process (Vallebueno-Estrada et al., 2016; Irving-Pease et al., 2019). Thus, paleogenomics is becoming a novel research area for understanding the process of plant and animal domestication (Irving-Pease et al., 2019). Extraction and Sequencing of Ancient DNA An important limitation of paleogenomic analyses is the level of preservation of the ancient DNA itself, as well as the total yield of extracted DNA (Sawyer et al., 2012). The DNAmolecules that are extracted from tissues that are not conserved on permafrost and are older than 100 years are usually shorter than 100 bp (Sawyer et al., 2012). The strand breaks of Frontiers in Genetics | www.frontiersin.org 13 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication such fragments are also non-random, as purines are enriched before the strand breaks (Sawyer et al., 2012). Additionally, these fragments incorporate cytosine-to-uracil mutations on their ends, further hindering the analysis of the sequenced fragments (Sawyer et al., 2012). Even though these characteristics hamper the sequencing and analysis of ancient DNA, they are also useful to differentiate between real ancient DNA and extant DNA contamination (Sawyer et al., 2012). Furthermore, due to the scarce ancient material located throughout few archeological sites worldwide, sample sizes in paleogenomic studies are very small, usually one or few individuals per location and sometimes only one locality (e.g.,Wales et al., 2016; Ramos-Madrigal et al., 2016). Given the above difficulties and the uniqueness of the biological material retrieved from archeological sites, it is crucial to extract and sequence as much ancient DNA as possible while avoiding DNA contamination (Gamba et al., 2016). Major efforts have been made to develop efficient protocols for ancient DNA extraction (Gamba et al., 2016) and single- strand library preparation for high-throughput sequencing (e.g., Gansauge et al., 2017). Organelle genomes were usually the target for ancient DNA sequencing because multiple copies of these can be found within each plant and animal cell and can reveal several demographic processes (Wales et al., 2016; Irving-Pease et al., 2019). Nonetheless, more evolutionary information can be retrieved from nuclear DNA, which is the main target for modern paleogenomic studies (Wales et al., 2016; Irving-Pease et al., 2019). Insights of Paleogenomic Data in Domestication Paleogenomic studies are challenging some of our previous ideas of the domestication process, such as the occurrence of ancient domestication bottlenecks, which appear to be absent in several archeological plant genomes, suggesting that the reduced diversity in domesticated taxa may be a more gradual process from what was expected using DNA of extant populations (Allaby et al., 2019). For example, several archeological samples of Sorghum bicolor from different time periods (ranging from 1800 to 100 years ago) were compared to extant individuals of the species, revealing that this crop did not suffered an initial domestication bottleneck, but rather that the reduction in genetic diversity, and its associated mutational load, occurred gradually throughout time (Smith et al., 2019). Paleogenomics is also revealing important aspects of plant and animal domestication, such as the first genetic steps towards domestication syndromes as well as the overall graduality of the process (Ramos-Madrigal et al., 2016; Vallebueno-Estrada et al., 2016; Daly et al., 2018). For example, archeological remains of goat populations have revealed multiple domestication processes in ancient wild goats, possible dispersal routes of ancient goat populations and signs of early selective pressures towards candidate genes involved in pigmentation, milk production, size, reproduction and changes in diet (Daly et al., 2018). Likewise, several archeological maize samples retrieved from the Tehuacán Valley in Mexico have revealed that early domesticates already presented signals of selective sweeps on important candidate genes, such as teosinte branched1 and brittle endosperm2, but lacked selective sweep signals in other important candidate genes present on modern maize populations, even though these ancient maize populations were already endogamous and more closely related to modern maize than to wild teosinte, revealing that maize domestication was a gradual process ranging thousands of years (Ramos-Madrigal et al., 2016; Vallebueno-Estrada et al., 2016). Other examples demonstrate the importance of paleogenomic studies in domesticated taxa, including grapevine (Wales et al., 2016), barley (Mascher et al., 2016), sunflower (Wales et al., 2019), horses (Schubert et al., 2014), dogs (Frantz et al., 2016) and cats (Ottoni et al., 2017). RNA SEQUENCING TO DETECT DIFFERENTIALLY EXPRESSED GENES ASSOCIATED TO DOMESTICATION Besides the use of RNA-seq to obtain population-level data, comparative transcriptomics is a good way to find or support the validity of candidate genes (Hekman et al., 2015). Transcriptomic analyses between domesticated and wild taxa can reveal important changes in gene expression associated to domestication (Koenig et al., 2013; Hekman et al., 2015; Hradilová et al., 2017). Likewise, the analysis of hybrids between domesticated and wild individuals can reveal important patterns of allele-specific regulation and the role of cis/trans regulatory elements in the emergence of domestication traits (Bell et al., 2013; Lemmon et al., 2014). The Experimental Design of Differential Expression Analyses Transcriptomic profiles are tissue-specific and time-dependent (Hekman et al., 2015). Thus, a good experimental design can reveal important loci involved in the phenotypic differences associated to domestication syndromes, such as suppression of secondary metabolites, changes in form, size, taste, absence of defense mechanisms, seed dormancy, docile behavior, among other traits (Hekman et al., 2015). This can be done by comparing the total RNA expression of the tissue or organ of interest (Koenig et al., 2013), as well as comparing RNA expression throughout the developmental stages of such tissue or organ (Hradilová et al., 2017). Since transcriptomic analyses are experimental by nature, experimental designs require biological replicates for each treatment, condition or organ to assess the variability in the data; as well as controlled environmental conditions to reduce possible biases and sources of error (Fang and Cui, 2011; Schurch et al., 2016). Empirical studies recommend using at least six biological replicates for each condition in the experiment, even though the use of three replicates is common, but discouraged (Burden et al., 2014; Schurch et al., 2016). Additionally, it is important to avoid committing Frontiers in Genetics | www.frontiersin.org 14 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication errors in the experimental design that can bias the results of the RNA-seq experiment, such as using different sequencing technologies for each sample, using different methods for library preparations throughout the samples, sequencing each treatment in a different sequencing flowcell or different lanes within a flowcell (Fang and Cui, 2011). Other technical biases associated to adapter ligation and within-lane variation can be properly assessed when using biological replicates (Fang and Cui, 2011; see Table 1). RNA-seq data can also be complemented with metabolomic data to infer the association between the differential expression of genes and the presence/absence of metabolites between wild and domesticated taxa (Hradilová et al., 2017). After obtaining high-quality data with an appropriate experimental design, RNA-seq analyses usually follow a similar workflow, which should culminate in the detection of differentially expressed genes between a wild plant and its domesticated counterpart (Yang and Kim, 2015; see Table 1). These differentially expressed genes are most likely candidates that may explain to some degree the changes associated to domestication (Koenig et al., 2013; Hradilová et al., 2017). Nonetheless, one must be careful while interpreting the results of these studies, as some differentially expressed genes between wild and domesticated taxa may be a consequence, rather than a cause, of the domestication traits under study (Albert et al., 2012). Successful Examples of RNA-seq Experiments to Understand Domestication RNA-seq analysis has been successfully employed to discover differentially expressed genes involved in the domestication of several plant species. For example, RNA-seq analyses between maize and teosinte found 600 differentially expressed genes and 1,100 genes with altered patterns of co-expression, mainly involved in biotic stress responses, and many of which were previously found as candidate genes using selective scans (Myers et al., 2012). Similar results have been found in tomato (Koenig et al., 2013), pea (Hradilová et al., 2017), common bean (Singh et al., 2018), and carrot (Machaj et al., 2018). This approach has also led to the discovery of differentially expressed genes between dogs and wolves associated to tameness (Li et al., 2013), as well as changes related to the immune system and aerobic capacity (Yang et al., 2018). Another study found differential isoform expression between wild and domesticated sorghum accessions, revealing that domestication can alter the patterns of alternative spicing (Ranwez et al., 2017). Hybrid studies have been performed between maize and teosinte, suggesting potential selection on cis regulatory elements associated with changes in ear tissue and previously reported candidate genes (Lemmon et al., 2014). Another hybrid study in Capsicum annuum using network analyses revealed that loss of function in cis regulatory sequences lead to transcriptional changes in trans elements that are associated with fruit morphology (Díaz-Valenzuela et al., 2020). MODERN EPIGENOMICS AND METHODOLOGICAL STRATEGIES TO EXPLORE DOMESTICATION Epigenetics is classically defined as the heritable mechanisms that regulate gene expression without direct modifications to the DNA sequence, namely DNA methylation, RNA methylation, covalent histone modifications and chromatin assembly states (Sakurada, 2010; Zhao et al., 2017). Epigenetic variants, sometimes called epialleles, are local differences in these epigenetic marks between individuals in a population, which can have similar dynamics to genetic variants (Weigel and Colot, 2012; Guo et al., 2015). Since epigenetic mechanisms underly the ability of organisms to respond to changing environmental conditions, some epigenetic marks associated to these responses are more susceptible to change due to environmental input, while other marks involved in cell differentiation, embryonic development and core cellular functions might be more stable (Turner, 2009). Most of the domestication studies that explain phenotypic differences between wild and domesticated taxa focus on genetic variation. However, the study of epigenomics may explain some of the missing heritability in domestication traits (i.e., the gap between the heritability of a trait estimated by classic genetics and GWAS), the patterns of differentially expressed genes that do not have clear signs of selective sweeps, or even connect the causality between the genetic variation that was selected for during domestication and the resulting phenotypes (Schmitz et al., 2013; Trerotola et al., 2015; Janowitz Koch et al., 2016; Bélteky et al., 2018). Epigenetic variation can be inherited from one generation to the next in a process known as trans-generational epigenetic inheritance, which has been documented in plants and animals (Heard and Martienssen, 2014), even though the overall importance of this trans-generational epigenetic inheritance in plant and animal evolution is still debated (see Table 1). Nevertheless, we consider that studying epigenetic patterns associated to transcriptional activity and phenotypic traits should help understand the emergence of domestication phenotypes (Bélteky et al., 2018). If epigenetic variants such as single methylation polymorphisms (SMPs) show complete transgenerational inheritance, they can even be analyzed using the theoretical tools of population genetics to detect selective sweeps (Schmitz et al., 2013; Janowitz Koch et al., 2016). In a similar fashion to GWAS, the use of epigenome-wide association studies (EWAS) can also reveal the association of an epigenetic variant to a trait of interest in domesticated taxa (Feeney et al., 2014). The same precautions taken in transcriptomic data should also be taken for epigenomic data, since the patterns of epigenetic marks in organisms are tissue- specific, time-dependent and sensitive to environmental input, meaning that epigenomic data should be analyzed for specific organs or tissues of interest in a controlled environment (Jensen, 2015). This is particularly important for the epigenetic marks that respond to environmental input, since domesticated taxa and their wild relatives live under different environmental Frontiers in Genetics | www.frontiersin.org 15 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication conditions. Growing both taxa under controlled conditions will alter the natural state of these marks, but will also help differentiate the heritable epialleles associated to domestication traits (Turner, 2009). Obtaining Population Data From Epigenetic Marks The most studied epigenetic mark is DNA 5-methylcytosine, which refers to the DNA methylation in cytosines which are usually associated to transcriptional gene silencing (He et al., 2011). Cytosine methylome data can be obtained using high-throughput sequencing technologies alongside bisulfite sequencing (Meissner, 2005). Bisulfite sequencing consists in the deamination of unmethylated cytosines through a bisulfite reaction, converting them into uracil, which are encoded as thymine by sequencing technologies (Frommer et al., 1992). The comparison of sequenced DNA that was treated with bisulfite alongside sequenced DNA without treatment can discriminate between methylated and unmethylated cytosines in an organ, tissue or cell-type of interest (Frommer et al., 1992). Reduced representation bisulfite sequencing (RRBS) is a high-throughput technique with a similar rationale to RAD- seq that enriches the sequencing of CG rich regions of the genome after the digestion of restriction enzymes (Meissner, 2005). This makes the RRBS technique a cost-effective option to analyze cytosine methylation patterns in mammals, since its cytosine DNA methylation happens at CG sites (Meissner, 2005; He et al., 2011). Plant cytosine methylomes should instead be analyzed through MethylC-seq, which consists of whole- genome sequencing and bisulfite treatment (Urich et al., 2015), as cytosine methylation can also happen in CHG and CHH sites in plant genomes (He et al., 2011). Cytosine methylation can also be detected using methylated DNA immunoprecipitation sequencing (MeDIP-seq), which consists in shearing the genomic DNA into small pieces followed by the immunoprecipitation of the methylated cytosines using antibodies that recognizes 5- methylcytosine and finally sequencing the DNA sequences with the methylated sites using standard high-throughput sequencing technologies (Weber et al., 2005). Besides cytosine methylation, adenine has also been shown to be methylated in both plants and animals (N6-methyldeoxyadenosine), which cannot be detected using bisulfite sequencing (Luo et al., 2015). However, genomic regions with methylated adenines can be detected using N6-methyldeoxyadenosine immunoprecipitation sequencing (6mA-IP-seq), which uses the same rationale as MeDIP-seq but requires antibodies that specifically targets N6- methyldeoxyadenosine (Fu et al., 2015). PacBio and Nanopore sequencing technologies are known to be sensitive to DNA methylation, regardless of it being on a cytosine or adenine, so they are currently being used as powerful, albeit expensive tools to evaluate DNA methylation patterns in genomes (Gouil and Keniry, 2019). Histone modifications refers to either posttranslational covalent modifications in histones (methylations, acetylations, phosphorylations, ubiquitylations, ADP-ribosylations, sumoylations, crotonylations, malonylations, succinylations) or the substitution of canonical histones by histone variants with different amino acid composition (Bowman and Poirier, 2015). These histone modifications determine the functionality of local genomic regions by changing the state of the chromatin either through its direct effects on the chemical interactions between DNA and histones or through the recruitment of chromatin remodeling complexes (Bowman and Poirier, 2015). Chromatin immunoprecipitation sequencing (ChIP-seq) can be used to assess the genome-wide association between DNA regions and specific histone modifications (Schmidt et al., 2009). ChIP-seq consists in the initial fixation of DNA-protein interactions using formaldehyde followed by DNA fragmentation and subsequent enrichment of the target histone modification using magnetic beads coupled to antibodies in order to sequence the genomic regions where the histone modification is present (Schmidt et al., 2009). ChIP-seq can also be used to assess the interaction between any DNA-binding protein such as transcriptional factors and specific genomic regions (Schmidt et al., 2009). Epigenomic Studies Applied to Understand Domestication The current epigenomic analyses regarding domestication have focused on DNA methylation patterns (Jensen, 2015; Ding and Chen, 2018), but some studies have also ventured into histone modification patterns (He et al., 2014). Recent efforts are trying to connect the discoveries of genomics and epigenetics to understand the evolution of tameness in domesticated animals (Jensen, 2015). A study using RRBS that compared the DNA methylation patterns between wolves and dogs revealed signals of natural selection acting on SMPs which are enriched in transposons and genes involved in the regulation of neurotransmitters, suggesting a dog-specific silencing of genes involved in behavior (Janowitz Koch et al., 2016). Similarly, a recent study using MeDIP-seq in red junglefowl populations that were bred to have either high or low fear to humans discovered genomic region that were differentially methylated in genes that were previously related to tameness (Bélteky et al., 2018). Other studies focused on plant domestication have found differentially methylated sites associated to domestication syndromes (Song et al., 2017; Shen et al., 2018). A study using MethylC-seq found 519 differentially methylated genes between domesticated and wild cotton from which some of them are associated with the observed differences in flowering time and seed dormancy between the wild and domesticated taxa (Song et al., 2017). Another study using MethylC-seq found 4,248 differentially methylated regions between wild and domesticated soybean and 1,164 differentially methylated regions between domesticated and improved soybean (Shen et al., 2018). As expected, the differentially methylated regions in soybean had higher genetic diversity compared to the regions with evidence of selective sweeps that were previously found, and interestingly, 22.5% of the differentially methylated sites could be associated to a causal genetic variant (suggesting that these genetic variants were responsible for the observed epigenetic Frontiers in Genetics | www.frontiersin.org 16 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication patterns), whereas the rest of the differentially methylated regions could be interpreted as genuine epialleles located within genes involved in carbohydrate metabolism (Shen et al., 2018). EXPERIMENTAL VALIDATION OF CANDIDATE GENES Once we have evidence of candidate genes involved in the domestication syndrome, the necessary next step to understand the genetic basis of domestication is to design in vitro systems, knock-out, knock-down or knock-in experiments that validate the involvement of such genes in the observed phenotypes (Zhang et al., 2017). This can be performed either by direct alteration of the genome in the organism of interest, by using RNA interference or by designing heterologous systems in a model organism (Boettcher and McManus, 2015). As an example, a knock-out experiment with backcrosses between domesticated and wild mice elucidated the role of some genes involved in behavioral changes associated to mouse domestication (Chalfin et al., 2014). Previous knock-out and knock-in experiments were restricted to model organisms, but nowadays experimental validation of candidate genes can be supported via knock-out and knock- in experiments, using novel genome editing tools (e.g., Shalem et al., 2014; Hahn et al., 2017; Ueta et al., 2017). Genome-editing tools are already available for a broad range of taxa, including dozens of crop species, but developing a working system in non-model organisms can still be a difficult task that can take several months or even years to accomplish (Shan et al., 2020), so doing collaborative studies alongside experimental researchers is recommended. In this moment, the leading toolset to perform genome editing is the Clustered Regulatory Interspaced Short Palindromic Repeats (CRISPR) system alongside the CRISPR associated protein 9 (Cas9), commonly known as CRISPR/Cas9, which can be used to eliminate, introduce or replace specific segments of DNA within a targeted site in a genome (Cong et al., 2013). Another useful tool for genome editing is the Transcription Activator-Like Effector Nuclease (TALEN) technology, which has its own advantages in comparison to CRISPR/CAS9 (Zhang et al., 2017). RNA interference can also help in validating the function of candidate genes, although it is limited to knock-down experiments (Boettcher and McManus, 2015). Heterologous expression in model organisms is a cost- effective alternative to validate candidate genes (e.g., Schweiger et al., 2010), although this method overlooks the interaction networks that exist in vivo which are accountable for the emergence of phenotypes (Rodríguez-Mega et al., 2015). Regardless the genome-editing tool of choice (Boettcher and McManus, 2015; Zhang et al., 2017), genome edition is proving its usefulness to validate the effect of candidate genes involved in domestication through the introduction of domesticated alleles on wild relatives and vice-versa (Zhou J. et al., 2019), which can prove that the gene is indeed involved in the appearance of the domesticated phenotype (Zhou J. et al., 2019). This can be performed in the same way as a usual knock-out or knock-in experiment, where the edited locus must be validated through PCR and Sanger sequencing, a PCR-RFLP analysis or using Western-blot in case of a protein knock-out (e.g., Ueta et al., 2017). The expected result of these type of studies is to find a modified phenotype after editing a candidate locus, either a wild individual with a domesticated-like phenotypic trait or a domesticated individual with a wild-like phenotypic trait (Zhou J. et al., 2019). Of course, the above studies will hardly reproduce a complete domesticated or wild phenotype, since genetic elements interact in complex regulatory networks, including other elements within the genome as well as epigenetic and environmental components (Rodríguez-Mega et al., 2015), but nonetheless will be useful to understand the role of those genes in the emergence of domesticated phenotypes. Once the candidate genes are validated, genome-editing tools can also become useful to introduce desirable traits from wild relatives to its domesticated counterparts, a goal of great interest for crop improvement (Zhou J. et al., 2019) and currently used to accelerate plant breeding and to fine-tune desirable traits (Wolter et al., 2019). Furthermore, recent efforts are trying to domesticate plant crops de novo by inserting the desired domestication alleles into their wild relatives, generating crops with the desired domestication phenotypes but without the problems of low genetic variation and accumulation of deleterious mutations that are an inevitable consequence of regular domestication processes (Fernie and Yan, 2019). CONCLUSION AND PERSPECTIVES Plant and animal domestication can be studied using genomic, transcriptomic and epigenomic strategies, revealing the action of evolutionary, ecological and anthropogenic processes (Kantar et al., 2017). These tools can lead us beyond the description of the possible historical scenarios that shaped the domesticated species, since we can explore the effects of domestication on the transcriptomic activity of a species (Hekman et al., 2015), test the validity of candidate genes associated to domestication phenotypes (Zhou J. et al., 2019) and analyze epigenetic patterns associated to domestication traits (Jensen, 2015). Many domesticated taxa remain genetically unexplored, and as sequencing technologies become cheaper and more efficient, domestication genomics will soon be available for polyploids and species with huge genomes (e.g., Edger et al., 2019). Nonetheless, the modern study of domestication of plants and animals should still be multidisciplinary, since genetics only tells us part of the story (Larson et al., 2014). An extended synthesis framework should also be considered to understand domestication, as these new studies are helping us understand niche construction and the emergence of domesticated phenotypes (Piperno, 2017). Other potential lines of work remain to be addressed in domestication studies, such as the changes in the chromatin architecture (e.g., Concia et al., 2020), the use of comparative proteomic atlases (e.g., Jiang Y. et al., 2019) and the analysis of cell-type divergences during development using single-cell RNA-seq data (Arendt et al., 2016). The use of this multi-omic approaches will help us create and Frontiers in Genetics | www.frontiersin.org 17 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication compare developmental atlases (e.g., Walley et al., 2016) between wild and domesticated taxa to understand how morphology diverged during domestication. AUTHOR CONTRIBUTIONS JB-R, DP, and LE wrote the manuscript. All authors contributed to the article and approved the submitted version. FUNDING This study was funded by Comisión Nacional para el Conocimiento y Uso de la Biodiversidad (CONABIO) KE004 “Diversidad genética de las especies de Cucurbita en México e hibridación entre plantas genéticamente modificadas y especies silvestres de Cucurbita” and CONABIO PE001 “Diversidad genética de las especies de Cucurbita en México. Fase II. Genómica evolutiva y de poblaciones, recursos genéticos y domesticación, both awarded to Rafael Lira-Saade and LE. DP was funded by Instituto de Ecología, UNAM”. JB-R is a doctoral student from Programa de Doctorado en Ciencias Biomédicas, Universidad Nacional Autónoma de México and received fellowship 583146 from Consejo Nacional de Ciencia y Tecnología (CONACyT). ACKNOWLEDGMENTS We acknowledge the Doctorado en Ciencias Biomédicas for the academic support provided to JB-R during the development of this project.We thank AlejandraMoreno Letelier for her insights, which greatly improved the quality of the manuscript. SUPPLEMENTARY MATERIAL The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2020.00742/full#supplementary-material REFERENCES Alachiotis, N., and Pavlidis, P. (2018). RAiSD detects positive selection based on multiple signatures of a selective sweep and SNP vectors. Commun. Biol. 1, 1–11. doi: 10.1038/s42003-018-0085-8 Alachiotis, N., Stamatakis, A., and Pavlidis, P. (2012). OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets. Bioinformatics 28, 2274–2275. doi: 10.1093/bioinformatics/bts419 Albert, F. W., Somel, M., Carneiro, M., Aximu-Petri, A., Halbwax, M., Thalmann, O., et al. (2012). A comparison of brain gene expression levels in domesticated and wild animals. PLoS Genet. 8:e1002962. doi: 10.1371/journal.pgen.1002962 Alberto, F. J., Boyer, F., Orozco-terWengel, P., Streeter, I., Servin, B., de Villemereuil, P., et al. (2018). Convergent genomic signatures of domestication in sheep and goats. Nat. Commun. 9:813. doi: 10.1038/s41467-018-03206-y Alexander, D. H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664. doi: 10.1101/ gr.094052.109 Ali, O. A., O’Rourke, S. M., Amish, S. J., Meek, M. H., Luikart, G., Jeffres, C., et al. (2016). RAD capture (Rapture): flexible and efficient sequence-based genotyping. Genetics 202, 389–400. doi: 10.1534/genetics.115.183665 Allaby, R. G., Ware, R. L., and Kistler, L. (2019). A re−evaluation of the domestication bottleneck from archaeogenomic evidence. Evol. Appl. 12, 29–37. doi: 10.1111/eva.12680 Anderson, C. N. K., Ramakrishnan, U., Chan, Y. L., and Hadly, E. A. (2005). Serial SimCoal: a population genetics model for data frommultiple populations and points in time. Bioinformatics 21, 1733–1734. doi: 10.1093/bioinformatics/ bti154 Andrews, K. R., Good, J. M., Miller, M. R., Luikart, G., andHohenlohe, P. A. (2016). Harnessing the power of RADseq for ecological and evolutionary genomics. Nat. Rev. Genet. 17, 81–92. doi: 10.1038/nrg.2015.28 Arendt, D., Musser, J. M., Baker, C. V., Bergman, A., Cepko, C., Erwin, D. H., et al. (2016). The origin and evolution of cell types. Nat. Rev. Genet. 17, 744–757. doi: 10.1038/nrg.2016.127 Badouin, H., Gouzy, J., Grassa, C. J., Murat, F., Staton, S. E., Cottret, L., et al. (2017). The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution. Nature 546, 148–152. doi: 10.1038/nature22380 Baran, Y., Pasaniuc, B., Sankararaman, S., Torgerson, D. G., Gignoux, C., Eng, C., et al. (2012). Fast and accurate inference of local ancestry in Latino populations. Bioinformatics 28, 1359–1367. doi: 10.1093/bioinformatics/bts144 Baute, G. J., Kane, N. C., Grassa, C. J., Lai, Z., and Rieseberg, L. H. (2015). Genome scans reveal candidate domestication and improvement genes in cultivated sunflower, as well as post-domestication introgression with wild relatives. New Phytol. 206, 830–838. doi: 10.1111/nph.13255 Bell, G. D., Kane, N. C., Rieseberg, L. H., and Adams, K. L. (2013). RNA-seq analysis of allele-specific expression, hybrid effects, and regulatory divergence in hybrids compared with their parents from natural populations. Genome Biol. Evol. 5, 1309–1323. doi: 10.1093/gbe/evt072 Belser, C., Istace, B., Denis, E., Dubarry, M., Baurens, F. C., Falentin, C., et al. (2018). Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps. Nat. Plants 4, 879–887. doi: 10.1038/s41477-018- 0289-4 Bélteky, J., Agnvall, B., Bektic, L., Höglund, A., Jensen, P., and Guerrero-Bosagna, C. (2018). Epigenetics and early domestication: differences in hypothalamic DNA methylation between red junglefowl divergently selected for high or low fear of humans. Genet. Sel. Evol. 50:13. doi: 10.1186/s12711-018-0384-z Bertolini, F., Servin, B., Talenti, A., Rochat, E., Kim, E. S., Oget, C., et al. (2018). Signatures of selection and environmental adaptation across the goat genome post-domestication. Genet. Sel. Evol. 50:57. doi: 10.1186/s12711-018-0421-y Bickhart, D. M., Rosen, B. D., Koren, S., Sayre, B. L., Hastie, A. R., Chan, S., et al. (2017). Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat. Genet. 49, 643–650. doi: 10.1038/ng.3802 Boettcher, M., and McManus, M. T. (2015). Choosing the right tool for the job: RNAi, TALEN, or CRISPR. Mol. Cell 58, 575–585. doi: 10.1016/j.molcel.2015. 04.028 Bomba, L., Nicolazzi, E. L., Milanesi, M., Negrini, R., Mancini, G., Biscarini, F., et al. (2015). Relative extended haplotype homozygosity signals across breeds reveal dairy and beef specific signatures of selection. Genet. Sel. Evol. 47:25. doi: 10.1186/s12711-015-0113-9 Bonhomme, M., Chevalet, C., Servin, B., Boitard, S., Abdallah, J., Blott, S., et al. (2010). Detecting selection in population trees: the lewontin and krakauer test extended. Genetics 186, 241–262. doi: 10.1534/genetics.104.117275 Bowman, G. D., and Poirier, M. G. (2015). Post-translational modifications of histones that influence nucleosome dynamics. Chem. Rev. 115, 2274–2295. doi: 10.1021/cr500350x Brohammer, A. B., Kono, T. J. Y., and Hirsch, C. N. (2018). “The maize pan- genome,” in The Maize Genome. Compendium of Plant Genomes, eds J. Bennetzen, S. Flint-Garcia, C. Hirsch, and R. Tuberosa (Cham: Springer), 13–29. doi: 10.1007/978-3-319-97427-9_2 Brozynska, M., Furtado, A., and Henry, R. J. (2016). Genomics of crop wild relatives: expanding the gene pool for crop improvement. Plant Biotechnol. J. 14, 1070–1085. doi: 10.1111/pbi.12454 Frontiers in Genetics | www.frontiersin.org 18 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication Bryant, D. M., Johnson, K., DiTommaso, T., Tickle, T., Couger, M. B., Payzin- Dogru, D., et al. (2017). A tissue-mapped axolotl de novo transcriptome enables identification of limb regeneration factors. Cell Rep. 18, 762–776. doi: 10.1016/ j.celrep.2016.12.063 Burden, C. J., Qureshi, S. E., and Wilson, S. R. (2014). Error estimates for the analysis of differential expression from RNA-seq count data. PeerJ 2:e576. doi: 10.7717/peerj.576 Burgarella, C., Cubry, P., Kane, N. A., Varshney, R. K., Mariac, C., Liu, X., et al. (2018). A western Sahara centre of domestication inferred from pearl millet genomes. Nat. Ecol. Evol. 2, 1377–1380. doi: 10.1038/s41559-018-0643-y Burggren, W. (2016). Epigenetic inheritance and its role in evolutionary biology: re-evaluation and new perspectives. Biology (Basel). 5:24. doi: 10.3390/ biology5020024 Cao, K., Li, Y., Deng, C. H., Gardiner, S. E., Zhu, G., Fang, W., et al. (2019). Comparative population genomics identified genomic regions and candidate genes associated with fruit domestication traits in peach. Plant Biotechnol. J. 17, 1954–1970. doi: 10.1111/pbi.13112 Catchen, J., Hohenlohe, P., Bassham, S., Amores, A., and Cresko, W. (2013). Stacks: an analysis tool set for population genomics. Mol. Ecol. 22, 3124–3140. doi: 10.1016/j.biotechadv.2011.08.021.Secreted Chalfin, L., Dayan, M., Levy, D. R., Austad, S. N., Miller, R. A., Iraqi, F. A., et al. (2014). Mapping ecologically relevant social behaviours by gene knockout in wild mice. Nat. Commun. 5:4569. doi: 10.1038/ncomms5569 Charlesworth, B. (2009). Effective population size and patterns of molecular evolution and variation. Nat. Rev. Genet. 10, 195–205. doi: 10.1038/nrg2526 Chen, H., Patterson, N., and Reich, D. (2010). Population differentiation as a test for selective sweeps. Genome Res. 20, 393–402. doi: 10.1101/gr.100545.109 Chen, J., Ni, P., Li, X., Han, J., Jakovliæ, I., Zhang, C., et al. (2018). Population size may shape the accumulation of functional mutations following domestication. BMC Evol. Biol. 18:4. doi: 10.1186/s12862-018-1120-6 Chen, N., Cai, Y., Chen, Q., Li, R., Wang, K., Huang, Y., et al. (2018). Whole- genome resequencing reveals world-wide ancestry and adaptive introgression events of domesticated cattle in East Asia. Nat. Commun. 9:2337. doi: 10.1038/ s41467-018-04737-0 Clark, J. W., and Donoghue, P. C. J. (2018). Whole-genome duplication and plant macroevolution. Trends Plant Sci. 23, 933–945. doi: 10.1016/j.tplants.2018. 07.006 Concia, L., Veluchamy, A., Ramirez-Prado, J. S., Martin-Ramirez, A., Huang, Y., Perez, M., et al. (2020). Wheat chromatin architecture is organized in genome territories and transcription factories. Genome Biol. 21, 1–20. doi: 10.1186/ s13059-020-01998-1 Cong, L., Ran, F. A., Cox, D., Lin, S., Barretto, R., Habib, N., et al. (2013). Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819–823. doi: 10.1126/science.1231143 Cornuet, J.-M., Pudlo, P., Veyssier, J., Dehne-Garcia, A., Gautier, M., Leblois, R., et al. (2014). DIYABC v2.0: a software to make approximate Bayesian computation inferences about population history using single nucleotide polymorphism, DNA sequence and microsatellite data. Bioinformatics 30, 1187–1189. doi: 10.1093/bioinformatics/btt763 Daly, K. G., Maisano Delser, P., Mullin, V. E., Scheu, A., Mattiangeli, V., Teasdale, M. D., et al. (2018). Ancient goat genomes reveal mosaic domestication in the Fertile Crescent. Science 361, 85–88. doi: 10.1126/science.aas9411 Davey, J. W., and Blaxter, M. L. (2010). RADSeq?: next-generation population genetics. Brief. Funct. Genomics 9, 416–423. doi: 10.1093/bfgp/elq031 DeMita, S., Thuillet, A.-C., Gay, L., Ahmadi, N., Manel, S., Ronfort, J., et al. (2013). Detecting selection along environmental gradients: analysis of eight methods and their effectiveness for outbreeding and selfing populations. Mol. Ecol. 22, 1383–1399. doi: 10.1111/mec.12182 de Villemereuil, P., and Gaggiotti, O. E. (2015). A new F ST -based method to uncover local adaptation using environmental variables. Methods Ecol. Evol. 6, 1248–1258. doi: 10.1111/2041-210X.12418 De Wit, P., Pespeni, M. H., Ladner, J. T., Barshis, D. J., Seneca, F., Jaris, H., et al. (2012). The simple fool’s guide to population genomics via RNA-Seq: an introduction to high-throughput sequencing data analysis. Mol. Ecol. Resour. 12, 1058–1067. doi: 10.1111/1755-0998.12003 Delaneau, O., Howie, B., Cox, A. J., Zagury, J.-F., and Marchini, J. (2013). Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696. doi: 10.1016/j.ajhg.2013.09.002 Diao, L., and Chen, K. C. (2012). Local ancestry corrects for population structure in Saccharomyces cerevisiae genome-wide association studies. Genetics 192, 1503–1511. doi: 10.1534/genetics.112.144790 Dias-Alves, T., Mairal, J., and Blum, M. G. B. (2018). Loter: a software package to infer local ancestry for a wide range of species. Mol. Biol. Evol. 35, 2318–2326. doi: 10.1093/molbev/msy126 Díaz-Valenzuela, E., Sawers, R. H., and Cibrián-Jaramillo, A. (2020). Cis-and trans- regulatory variations in the domestication of the chili pepper fruit. Mol. Biol. Evol. 37, 1593–1603. doi: 10.1093/molbev/msaa027 Ding, M., and Chen, Z. J. (2018). Epigenetic perspectives on the evolution and domestication of polyploid plant and crops. Curr. Opin. Plant Biol. 42, 37–48. doi: 10.1016/j.pbi.2018.02.003 Doebley, J., and Stec, A. (1991). Genetic analysis of the morphological differences between maize and teosinte. Genetics 129, 285–295. Doebley, J., Stec, A., and Gustus, C. (1995). teosinte branched1 and the origin of maize: evidence for epistasis and the evolution of dominance. Genetics 141, 333–346. Dong, Y., Xie, M., Jiang, Y., Xiao, N., Du, X., Zhang, W., et al. (2013). Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat. Biotechnol. 31, 135–141. doi: 10.1038/nb t.2478 Dorant, Y., Benestan, L., Rougemont, Q., Normandeau, E., Boyle, B., Rochette, R., et al. (2019). Comparing Pool−seq, Rapture, and GBS genotyping for inferring weak population structure: the American lobster (Homarus americanus) as a case study. Ecol. Evol. 9, 6606–6623. doi: 10.1002/ece3.5240 Durand, E. Y., Patterson, N., Reich, D., and Slatkin, M. (2011). Testing for ancient admixture between closely related populations. Mol. Biol. Evol. 28, 2239–2252. doi: 10.1093/molbev/msr048 Edelman, N. B., Frandsen, P. B., Miyagi, M., Clavijo, B., Davey, J., Dikow, R. B., et al. (2019). Genomic architecture and introgression shape a butterfly radiation. Science 366, 594–599. doi: 10.1126/science.aaw2090 Edger, P. P., Poorten, T. J., VanBuren, R., Hardigan, M. A., Colle, M., McKain, M. R., et al. (2019). Origin and evolution of the octoploid strawberry genome. Nat. Genet. 51, 541–547. doi: 10.1038/s41588-019-0356-4 Ellegren, H. (2014). Genome sequencing and population genomics in non-model organisms. Trends Ecol. Evol. 29, 51–63. doi: 10.1016/j.tree.2013.09.008 Excoffier, L., and Foll, M. (2011). fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics 27, 1332–1334. doi: 10.1093/bioinformatics/b tr124 Excoffier, L., Hofer, T., and Foll, M. (2009). Detecting loci under selection in a hierarchically structured population. Heredity (Edinb). 103:285. doi: 10.1038/ hdy.2009.74 Fang, Z., and Cui, X. (2011). Design and validation issues in RNA-seq experiments. Brief. Bioinform. 12, 280–287. doi: 10.1093/bib/bbr004 Fay, J. C., and Wu, C. I. (2000). Hitchhiking under positive Darwinian selection. Genetics 155, 1405–1413. Feeney, A., Nilsson, E., and Skinner, M. K. (2014). Epigenetics and transgenerational inheritance in domesticated farm animals. J. Anim. Sci. Biotechnol. 5:48. doi: 10.1186/2049-1891-5-48 Fernie, A. R., and Yan, J. (2019). De novo domestication: an alternative route toward new crops for the future. Mol. Plant 12, 615–631. doi: 10.1016/j.molp. 2019.03.016 Fierst, J. L. (2015). Using linkage maps to correct and scaffold de novo genome assemblies: methods, challenges, and computational tools. Front. Genet. 6:220. doi: 10.3389/fgene.2015.00220 Fitz-Gibbon, S., Hipp, A. L., Pham, K. K., Manos, P. S., and Sork, V. L. (2017). Phylogenomic inferences from reference-mapped and de novo assembled short-read sequence data using RADseq sequencing of California white oaks (Quercus section Quercus). Genome 60, 743–755. doi: 10.1139/gen-2016-0202 Foll, M., and Gaggiotti, O. (2008). A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180, 977–993. doi: 10.1534/genetics.108.092221 Frantz, L. A. F., Mullin, V. E., Pionnier-Capitan, M., Lebrasseur, O., Ollivier, M., Perri, A., et al. (2016). Genomic and archaeological evidence suggest a dual origin of domestic dogs. Science 352, 1228–1231. doi: 10.1126/science.aaf3161 Frantz, L. A. F., Schraiber, J. G., Madsen, O., Megens, H.-J., Cagan, A., Bosse, M., et al. (2015). Evidence of long-term gene flow and selection during Frontiers in Genetics | www.frontiersin.org 19 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication domestication from analyses of Eurasian wild and domestic pig genomes. Nat. Genet. 47, 1141–1148. doi: 10.1038/ng.3394 Frommer, M., McDonald, L. E., Millar, D. S., Collis, C. M., Watt, F., Grigg, G. W., et al. (1992). A genomic sequencing protocol that yields a positive display of 5- methylcytosine residues in individual DNA strands. Proc. Natl. Acad. Sci. U.S.A. 89, 1827–1831. doi: 10.1073/pnas.89.5.1827 Fu, Y., Luo, G.-Z., Chen, K., Deng, X., Yu, M., Han, D., et al. (2015). N6-methyldeoxyadenosine marks active transcription start sites in chlamydomonas. Cell 161, 879–892. doi: 10.1016/j.cell.2015.04.010 Fugère, V., and Hendry, A. P. (2018). Human influences on the strength of phenotypic selection. Proc. Natl. Acad. Sci. U.S.A. 115, 10070–10075. doi: 10. 1073/pnas.1806013115 Fumagalli, M. (2013). Assessing the effect of sequencing depth and sample size in population genetics inferences. PLoS One 8:e79667. doi: 10.1371/journal.pone. 0079667 Futschik, A., and Schlötterer, C. (2010). The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics 186, 207–218. doi: 10.1534/genetics.110.114397 Gamba, C., Hanghøj, K., Gaunitz, C., Alfarhan, A. H., Alquraishi, S. A., Al-Rasheid, K. A. S., et al. (2016). Comparing the performance of three ancient DNA extraction methods for high-throughput sequencing. Mol. Ecol. Resour. 16, 459–469. doi: 10.1111/1755-0998.12470 Gansauge, M.-T., Gerber, T., Glocke, I., Korleviæ, P., Lippik, L., Nagel, S., et al. (2017). Single-stranded DNA library preparation from highly degraded DNA using T4 DNA ligase. Nucleic Acids Res. 45:gkx033. doi: 10.1093/nar/gkx033 Gao, L., Gonda, I., Sun, H.,Ma, Q., Bao, K., Tieman, D.M., et al. (2019). The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor. Nat. Genet. 51, 1044–1051. doi: 10.1038/s41588-019-0410-2 Garud, N. R.,Messer, P.W., Buzbas, E. O., and Petrov, D. A. (2015). Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLoS Genet. 11:e1005004. doi: 10.1371/journal.pgen.1005004 Gepts, P. (2014). The contribution of genetic and genomic approaches to plant domestication studies.Curr. Opin. Plant Biol. 18, 51–59. doi: 10.1016/j.pbi.2014. 02.001 Gerbault, P., Allaby, R. G., Boivin, N., Rudzinski, A., Grimaldi, I. M., Pires, J. C., et al. (2014). Storytelling and story testing in domestication. Proc. Natl. Acad. Sci. U.S.A. 111, 6159–6164. doi: 10.1073/pnas.1400425111 Gibson, G. (2018). Population genetics and GWAS: a primer. PLoS Biol. 16:e2005485. doi: 10.1371/journal.pbio.2005485 Golicz, A. A., Batley, J., and Edwards, D. (2016a). Towards plant pangenomics. Plant Biotechnol. J. 14, 1099–1105. doi: 10.1111/pbi.12499 Golicz, A. A., Bayer, P. E., Barker, G. C., Edger, P. P., Kim, H., Martinez, P. A., et al. (2016b). The pangenome of an agronomically important crop plant Brassica oleracea. Nat. Commun. 7, 1–8. doi: 10.1038/ncomms13390 Gouil, Q., and Keniry, A. (2019). Latest techniques to study DNA methylation. Essays Biochem. 63, 639–648. doi: 10.1042/EBC20190027 Groeneveld, L. F., Lenstra, J. A., Eding, H., Toro, M. A., Scherf, B., Pilling, D., et al. (2010). Genetic diversity in farm animals – a review. Anim. Genet. 41, 6–31. doi: 10.1111/j.1365-2052.2010.02038.x Guerra García, A., and Piñero, D. (2017). Current approaches andmethods in plant domestication studies. Bot. Sci. 95:345. doi: 10.17129/botsci.1209 Guerra-García, A., Suárez-Atilano, M., Mastretta-Yanes, A., Delgado-Salinas, A., and Piñero, D. (2017). Domestication genomics of the open-pollinated scarlet runner bean (Phaseolus coccineus L.). Front. Plant Sci. 8:1891. doi: 10.3389/fpls. 2017.01891 Guerrero-Bosagna, C. (2012). Finalism in darwinian and lamarckian evolution: lessons from epigenetics and developmental biology. Evol. Biol. 39, 283–300. doi: 10.1007/s11692-012-9163-x Günther, T., and Nettelblad, C. (2019). The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet. 15:e1008302. doi: 10.1371/journal.pgen.1008302 Guo, S., Zhang, J., Sun, H., Salse, J., Lucas, W. J., Zhang, H., et al. (2012). The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions. Nat. Genet. 45, 51–58. doi: 10.1038/ng.2470 Guo, Z., Song, G., Liu, Z., Qu, X., Chen, R., Jiang, D., et al. (2015). Global epigenomic analysis indicates that Epialleles contribute to Allele-specific expression via Allele-specific histone modifications in hybrid rice. BMC Genomics 16:232. doi: 10.1186/s12864-015-1454-z Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H., and Bustamante, C. D. (2009). Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5:e1000695. doi: 10.1371/ journal.pgen.1000695 Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., et al. (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512. doi: 10.1038/nprot.2013.084 Hahn, F., Eisenhut, M., Mantegazza, O., and Weber, A. (2017). Generation of targeted knockout mutants in Arabidopsis thaliana using CRISPR/Cas9. Bio Protoc. 7, 1–20. doi: 10.21769/BioProtoc.2384 Hahn,M.W., andHibbins, M. S. (2019). A three-sample test for introgression.Mol. Biol. Evol. 36, 2878–2882. doi: 10.1093/molbev/msz178 Han, L., and Abney, M. (2013). Using identity by descent estimation with dense genotype data to detect positive selection. Eur. J. Hum. Genet. 21, 205–211. doi: 10.1038/ejhg.2012.148 Hanchard, N. A., Rockett, K. A., Spencer, C., Coop, G., Pinder, M., Jallow, M., et al. (2006). Screening for recently selected alleles by analysis of human haplotype similarity. Am. J. Hum. Genet. 78, 153–159. doi: 10.1086/499252 He, S., Yan, S., Wang, P., Zhu, W., Wang, X., Shen, Y., et al. (2014). Comparative analysis of genome-wide chromosomal histone modification patterns in maize cultivars and their wild relatives. PLoS One 9:e97364. doi: 10.1371/journal.pone. 0097364 He, X.-J., Chen, T., and Zhu, J.-K. (2011). Regulation and function of DNA methylation in plants and animals. Cell Res. 21, 442–465. doi: 10.1038/cr. 2011.23 Heard, E., andMartienssen, R. A. (2014). Transgenerational epigenetic inheritance: myths and mechanisms. Cell 157, 95–109. doi: 10.1016/j.cell.2014.02.045 Hekman, J. P., Johnson, J. L., and Kukekova, A. V. (2015). Transcriptome analysis in domesticated species: challenges and strategies. Bioinform. Biol. Insights 9(Suppl. 4), 21–31. doi: 10.4137/BBI.S29334 Heled, J., and Drummond, A. J. (2008). Bayesian inference of population size history from multiple loci. BMC Evol. Biol. 8:289. doi: 10.1186/1471-2148- 8-289 Hradilová, I., Trnìnı , O., Válková, M., Cechová, M., Janská, A., Prokešová, L., et al. (2017). A combined comparative transcriptomic, metabolomic, and anatomical analyses of two key domestication traits: pod dehiscence and seed dormancy in pea (Pisum sp.). Front. Plant Sci. 8:542. doi: 10.3389/fpls.2017.00542 Huang, X., Kurata, N., Wei, X., Wang, Z.-X., Wang, A., Zhao, Q., et al. (2012). A map of rice genome variation reveals the origin of cultivated rice. Nature 490, 497–501. doi: 10.1038/nature11532 Hübner, S., Bercovich, N., Todesco, M., Mandel, J. R., Odenheimer, J., Ziegler, E., et al. (2019). Sunflower pan-genome analysis shows that hybridization altered gene content and disease resistance. Nat. Plants 5, 54–62. doi: 10.1038/s41477- 018-0329-0 Hufford, M. B., Xu, X., van Heerwaarden, J., Pyhäjärvi, T., Chia, J.-M., Cartwright, R. A., et al. (2012). Comparative population genomics of maize domestication and improvement. Nat. Genet. 44, 808–811. doi: 10.1038/ng.2309 Hurgobin, B., Golicz, A. A., Bayer, P. E., Chan, C. K. K., Tirnaz, S., Dolatabadian, A., et al. (2018). Homoeologous exchange is amajor cause of gene presence/absence variation in the amphidiploid Brassica napus. Plant Biotechnol. J. 16, 1265–1274. doi: 10.1111/pbi.12867 Ibarra-Laclette, E., Lyons, E., Hernández-Guzmán, G., Pérez-Torres, C. A., Carretero-Paulet, L., Chang, T. H., et al. (2013). Architecture and evolution of a minute plant genome. Nature 498, 94–98. doi: 10.1038/nature12132 Ilska, J., Haskell, M. J., Blott, S. C., Sánchez-Molano, E., Polgar, Z., Lofgren, S. E., et al. (2017). Genetic characterization of dog personality traits. Genetics 206, 1101–1111. doi: 10.1534/genetics.116.192674 Inbar, S., Cohen, P., Yahav, T., and Privman, E. (2020). Comparative study of population genomic approaches for mapping colony-level traits. PLoS Comput. Biol. 16:e1007653. doi: 10.1371/journal.pcbi.1007653 Irving-Pease, E. K., Ryan, H., Jamieson, A., Dimopoulos, E. A., Larson, G., and Frantz, L. A. F. (2019). “Paleogenomics of animal domestication,” in Paleogenomics: Genome-Scale Analysis of Ancient DNA, eds C. Lindqvist and O. P. Rajora (Cham: Springer International Publishing), 225–272. doi: 10.1007/ 13836_2018_55 Janowitz Koch, I., Clark, M. M., Thompson, M. J., Deere-Machemer, K. A., Wang, J., Duarte, L., et al. (2016). The concerted impact of domestication and Frontiers in Genetics | www.frontiersin.org 20 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication transposon insertions on methylation patterns between dogs and grey wolves. Mol. Ecol. 25, 1838–1855. doi: 10.1111/mec.13480 Janzen, G. M., Wang, L., and Hufford, M. B. (2019). The extent of adaptive wild introgression in crops. New Phytol. 221, 1279–1288. doi: 10.1111/nph.15457 Jensen, P. (2015). Adding “epi-” to behaviour genetics: implications for animal domestication. J. Exp. Biol. 218, 32–40. doi: 10.1242/jeb.106799 Jiang, L. G., Li, B., Liu, S. X., Wang, H. W., Li, C. P., Song, S. H., et al. (2019). Characterization of proteome variation during modern maize breeding. Mol. Cell. Proteomics 18, 263–276. doi: 10.1074/mcp.RA118.001021 Jiang, Y., Jiang, Y., Wang, S., Zhang, Q., and Ding, X. (2019). Optimal sequencing depth design for whole genome re-sequencing in pigs. BMC Bioinform. 20:556. doi: 10.1186/s12859-019-3164-z Jiao, Y., Zhao, H., Ren, L., Song, W., Zeng, B., Guo, J., et al. (2012). Genome-wide genetic changes during modern breeding of maize. Nat. Genet. 44, 812–815. doi: 10.1038/ng.2312 Jombart, T., Devillard, S., and Balloux, F. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 11:94. doi: 10.1186/1471-2156-11-94 Jorde, L. B. (2001). Population genomics: a bridge from evolutionary history to genetic medicine. Hum. Mol. Genet. 10, 2199–2207. doi: 10.1093/hmg/10.20. 2199 Kantar, M. B., Nashoba, A. R., Anderson, J. E., Blackman, B. K., and Rieseberg, L. H. (2017). The genetics and genomics of plant domestication. Bioscience 67, 971–982. doi: 10.1093/biosci/bix114 Kaur, P., and Gaikwad, K. (2017). From genomes to GENE-omes: exome sequencing concept and applications in crop improvement. Front. Plant Sci. 8:2164. doi: 10.3389/fpls.2017.02164 Khan, A. W., Garg, V., Roorkiwal, M., Golicz, A. A., Edwards, D., and Varshney, R. K. (2020). Super-pangenome by integrating the wild side of a species for accelerated crop improvement. Trends. Plant Sci. 25, 148–158. doi: 10.1016/j. tplants.2019.10.012 Khan, M. A., Olsen, K. M., Sovero, V., Kushad, M. M., and Korban, S. S. (2014). Fruit quality traits have played critical roles in domestication of the apple. Plant Genome 7, 1–18. doi: 10.3835/plantgenome2014.04.0018 Kim, Y., and Nielsen, R. (2004). Linkage disequilibrium as a signature of selective sweeps. Genetics 167, 1513–1524. doi: 10.1534/genetics.103.025387 Koenig, D., Jimenez-Gomez, J. M., Kimura, S., Fulop, D., Chitwood, D. H., Headland, L. R., et al. (2013). Comparative transcriptomics reveals patterns of selection in domesticated and wild tomato. Proc. Natl. Acad. Sci. U.S.A. 110, E2655–E2662. doi: 10.1073/pnas.1309606110 Korneliussen, T. S., Albrechtsen, A., and Nielsen, R. (2014). ANGSD: analysis of next generation sequencing data. BMC Bioinform. 15:356. doi: 10.1186/s12859- 014-0356-4 Korneliussen, T. S., Moltke, I., Albrechtsen, A., and Nielsen, R. (2013). Calculation of Tajima’s D and other neutrality test statistics from low depth next- generation sequencing data. BMC Bioinform. 14:289. doi: 10.1186/1471-2105- 14-289 Larson, G., Piperno, D. R., Allaby, R. G., Purugganan, M. D., Andersson, L., Arroyo-Kalin, M., et al. (2014). Current perspectives and the future of domestication studies. Proc. Natl. Acad. Sci. U.S.A. 111, 6139–6146. doi: 10. 1073/pnas.1323964111 Lelieveld, S. H., Spielmann, M., Mundlos, S., Veltman, J. A., and Gilissen, C. (2015). Comparison of exome and genome sequencing technologies for the complete capture of protein-coding regions. Hum. Mutat. 36, 815–822. doi: 10.1002/humu.22813 Lemmon, Z. H., Bukowski, R., Sun, Q., and Doebley, J. F. (2014). The role of Cis regulatory evolution in maize domestication. PLoS Genet. 10:e1004745. doi: 10.1371/journal.pgen.1004745 Levy, S. E., andMyers, R. M. (2016). Advancements in next-generation sequencing. Annu. Rev. Genomics Hum. Genet. 17, 95–115. doi: 10.1146/annurev-genom- 083115-022413 Li, H., and Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature 475, 493–496. doi: 10.1038/ nature10231 Li, R., Fu, W., Su, R., Tian, X., Du, D., Zhao, Y., et al. (2019). Towards the complete goat pan-genome by recovering missing genomic segments from the reference genome. Front. Genet. 10:1169. doi: 10.3389/fgene.2019. 01169 Li, Y., Von Holdt, B. M., Reynolds, A., Boyko, A. R., Wayne, R. K., Wu, D. D., et al. (2013). Artificial selection on brain-expressed genes during the domestication of dog.Mol. Biol. Evol. 30, 1867–1876. doi: 10.1093/molbev/mst088 Li, Y., Zhou, G., Ma, J., Jiang, W., Jin, L., Zhang, Z., et al. (2014). De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nat. Biotechnol. 32, 1045–1052. doi: 10.1038/nbt.2979 Li, Z., Chen, J., Wang, Z., Pan, Y., Wang, Q., Xu, N., et al. (2016). Detection of selection signatures of population-specific genomic regions selected during domestication process in Jinhua pigs. Anim. Genet. 47, 672–681. doi: 10.1111/ age.12475 Librado, P., Fages, A., Gaunitz, C., Leonardi, M., Wagner, S., Khan, N., et al. (2016). The evolutionary origin and genetic makeup of domestic horses. Genetics 204, 423–434. doi: 10.1534/genetics.116.194860 Linck, E., and Battey, C. J. (2019). Minor allele frequency thresholds strongly affect population structure inference with genomic data sets. Mol. Ecol. Resour. 19, 639–647. doi: 10.1111/1755-0998.12995 Liu, X., and Fu, Y.-X. (2015). Exploring population size changes using SNP frequency spectra. Nat. Genet. 47, 555–559. doi: 10.1038/ng.3254 Lotterhos, K. E., and Whitlock, M. C. (2014). Evaluation of demographic history and neutral parameterization on the performance of F ST outlier tests.Mol. Ecol. 23, 2178–2192. doi: 10.1111/mec.12725 Lotterhos, K. E., and Whitlock, M. C. (2015). The relative power of genome scans to detect local adaptation depends on sampling design and statistical method. Mol. Ecol. 24, 1031–1046. doi: 10.1111/mec.13100 Lowry, D. B., Hoban, S., Kelley, J. L., Lotterhos, K. E., Reed, L. K., Antolin, M. F., et al. (2017). Breaking RAD: an evaluation of the utility of restriction site- associated DNA sequencing for genome scans of adaptation.Mol. Ecol. Resour. 17, 142–152. doi: 10.1111/1755-0998.12635 Luo, G.-Z., Blanco, M. A., Greer, E. L., He, C., and Shi, Y. (2015). DNA N6- methyladenine: a new epigenetic mark in eukaryotes? Nat. Rev. Mol. Cell Biol. 16, 705–710. doi: 10.1038/nrm4076 Luu, K., Bazin, E., and Blum, M. G. B. (2017). pcadapt?: an R package to perform genome scans for selection based on principal component analysis. Mol. Ecol. Resour. 17, 67–77. doi: 10.1111/1755-0998.12592 Lye, Z. N., and Purugganan, M. D. (2019). Copy number variation in domestication.Trends Plant Sci. 24, 352–365. doi: 10.1016/j.tplants.2019.01.003 Machaj, G., Bostan, H., Macko-Podgórni, A., Iorizzo, M., and Grzebelus, D. (2018). Comparative transcriptomics of root development in wild and cultivated carrots. Genes (Basel). 9:431. doi: 10.3390/genes9090431 Maples, B. K., Gravel, S., Kenny, E. E., and Bustamante, C. D. (2013). RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288. doi: 10.1016/j.ajhg.2013.06.020 Martin, S. H., Davey, J. W., and Jiggins, C. D. (2015). Evaluating the use of ABBA– BABA statistics to locate introgressed loci. Mol. Biol. Evol. 32, 244–257. doi: 10.1093/molbev/msu269 Mascher, M., Gundlach, H., Himmelbach, A., Beier, S., Twardziok, S. O., Wicker, T., et al. (2017). A chromosome conformation capture ordered sequence of the barley genome. Nature 544, 427–433. doi: 10.1038/nature22043 Mascher, M., Schuenemann, V. J., Davidovich, U., Marom, N., Himmelbach, A., Hübner, S., et al. (2016). Genomic analysis of 6,000-year-old cultivated grain illuminates the domestication history of barley. Nat. Genet. 48, 1089–1093. doi: 10.1038/ng.3611 Mastretta-Yanes, A., Arrigo, N., Alvarez, N., Jorgensen, T. H., Piñero, D., and Emerson, B. C. (2015). Restriction site-associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference. Mol. Ecol. Resour. 15, 28–41. doi: 10.1111/1755-0998. 12291 Mather, N., Traves, S. M., and Ho, S. Y. (2020). A practical introduction to sequentiallyMarkovian coalescent methods for estimating demographic history from genomic data. Ecol. Evol. 10, 579–589. doi: 10.1002/ece3.5888 Mathew, B., Léon, J., and Sillanpää, M. J. (2018). A novel linkage-disequilibrium corrected genomic relationship matrix for SNP-heritability estimation and genomic prediction. Heredity (Edinb). 120, 356–368. doi: 10.1038/s41437-017- 0023-4 Mazet, O., Rodríguez, W., and Chikhi, L. (2015). Demographic inference using genetic data from a single individual: separating population size variation from population structure. Theor. Popul. Biol. 104, 46–58. doi: 10.1016/j.tpb.2015. 06.003 Frontiers in Genetics | www.frontiersin.org 21 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication Meirmans, P. G. (2015). Seven common mistakes in population genetics and how to avoid them.Mol. Ecol. 24, 3223–3231. doi: 10.1111/mec.13243 Meissner, A. (2005). Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 33, 5868–5877. doi: 10.1093/nar/gki901 Meyer, R. S., Choi, J. Y., Sanches, M., Plessis, A., Flowers, J. M., Amas, J., et al. (2016). Domestication history and geographical adaptation inferred from a SNP map of African rice. Nat. Genet. 48, 1083–1088. doi: 10.1038/ng.3633 Meyer, R. S., and Purugganan, M. D. (2013). Evolution of crop species: genetics of domestication and diversification. Nat. Rev. Genet. 14, 840–852. doi: 10.1038/ nrg3605 Money, D., Migicovsky, Z., Gardner, K., and Myles, S. (2017). LinkImputeR: user-guided genotype calling and imputation for non-model organisms. BMC Genomics 18:523. doi: 10.1186/s12864-017-3873-5 Montenegro, J. D., Golicz, A. A., Bayer, P. E., Hurgobin, B., Lee, H., Chan, C. K. K., et al. (2017). The pangenome of hexaploid bread wheat. Plant Journal 90, 1007–1013. doi: 10.1111/tpj.13515 Moyers, B. T., Morrell, P. L., and McKay, J. K. (2018). Genetic costs of domestication and improvement. J. Hered. 109, 103–116. doi: 10.1093/jhered/ esx069 Muir, P., Li, S., Lou, S., Wang, D., Spakowicz, D. J., Salichos, L., et al. (2016). The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 17:53. doi: 10.1186/s13059-016-0917-0 Myers, C. L., Springer, N. M., Schaefer, R., Ross-Ibarra, J., Swanson-Wagner, R., Tiffin, P., et al. (2012). Reshaping of the maize transcriptome by domestication. Proc. Natl. Acad. Sci. U.S.A. 109, 11878–11883. doi: 10.1073/pnas.1201961109 Nadachowska-Brzyska, K., Burri, R., Smeds, L., and Ellegren, H. (2016). PSMC analysis of effective population sizes in molecular ecology and its application to black-and-white Ficedula flycatchers.Mol. Ecol. 25, 1058–1072. doi: 10.1111/ mec.13540 Ndjiondjop, M. N., Alachiotis, N., Pavlidis, P., Goungoulou, A., Kpeki, S. B., Zhao, D., et al. (2019). Comparisons of molecular diversity indices, selective sweeps and population structure of African rice with its wild progenitor and Asian rice. Theor. Appl. Genet. 132, 1145–1158. doi: 10.1007/s00122-018-3268-2 Nei, M., and Maruyama, T. (1975). Lewontin-Krakauer test for neutral genes. Genetics 80:395. Nielsen, R., and Beaumont, M. A. (2009). Statistical inferences in phylogeography. Mol. Ecol. 18, 1034–1047. doi: 10.1111/j.1365-294X.2008.04059.x Nowoshilow, S., Schloissnig, S., Fei, J. F., Dahl, A., Pang, A. W., Pippel, M., et al. (2018). The axolotl genome and the evolution of key tissue formation regulators. Nature 554, 50–55. doi: 10.1038/nature25458 Ottoni, C., Van Neer, W., De Cupere, B., Daligault, J., Guimaraes, S., Peters, J., et al. (2017). The palaeogenetics of cat dispersal in the ancient world. Nat. Ecol. Evol. 1:0139. doi: 10.1038/s41559-017-0139 Pankin, A., Altmüller, J., Becker, C., and von Korff, M. (2018). Targeted resequencing reveals genomic signatures of barley domestication. New Phytol. 218, 1247–1259. doi: 10.1111/nph.15077 Paterson, A. H., Lander, E. S., Hewitt, J. D., Peterson, S., Lincoln, S. E., and Tanksley, S. D. (1988). Resolution of quantitative traits into Mendelian factors by using a complete linkage map of restriction fragment length polymorphisms. Nature 335, 721–726. doi: 10.1038/335721a0 Patterson, N., Price, A. L., and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genet. 2:e190. doi: 10.1371/journal.pgen.0020190 Pavlidis, P., and Alachiotis, N. (2017). A survey of methods and tools to detect recent and strong positive selection. J. Biol. Res. 24:7. doi: 10.1186/s40709-017- 0064-0 Pickrell, J. K., and Pritchard, J. K. (2012). Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8:e1002967. doi: 10.1371/journal.pgen.1002967 Piperno, D. R. (2017). Assessing elements of an extended evolutionary synthesis for plant domestication and agricultural origin research. Proc. Natl. Acad. Sci. U.S.A. 114, 6429–6437. doi: 10.1073/pnas.1703658114 Pitt, D., Sevane, N., Nicolazzi, E. L., MacHugh, D. E., Park, S. D. E., Colli, L., et al. (2019). Domestication of cattle: two or three events? Evol. Appl. 12, 123–136. doi: 10.1111/eva.12674 Prezeworski, M., Coop, G., and Wall, J. D. (2005). The signature of positive selection on standing genetic variation. Evolution 59, 2312–2323. doi: 10.1111/ j.0014-3820.2005.tb00941.x Price, A. L., Tandon, A., Patterson, N., Barnes, K. C., Rafaels, N., Ruczinski, I., et al. (2009). Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5:e1000519. doi: 10.1371/journal.pgen. 1000519 Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics 155, 945–959. doi: 10.1111/ j.1471-8286.2007.01758.x Qi, J., Liu, X., Shen, D., Miao, H., Xie, B., Li, X., et al. (2013). A genomic variation map provides insights into the genetic basis of cucumber domestication and diversity. Nat. Genet. 45, 1510–1515. doi: 10.1038/ng.2801 Qiu, Q., Wang, L., Wang, K., Yang, Y., Ma, T., Wang, Z., et al. (2015). Yak whole-genome resequencing reveals domestication signatures and prehistoric population expansions. Nat. Commun. 6:10283. doi: 10.1038/ncomms10283 Raj, A., Stephens, M., and Pritchard, J. K. (2014). fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589. doi: 10.1534/genetics.114.164350 Ramos-Madrigal, J., Smith, B. D., Moreno-Mayar, J. V., Gopalakrishnan, S., Ross- Ibarra, J., Gilbert, M. T. P., et al. (2016). Genome sequence of a 5,310-year-old maize cob provides insights into the early stages of maize domestication. Curr. Biol. 26, 3195–3201. doi: 10.1016/j.cub.2016.09.036 Ranwez, V., Serra, A., Pot, D., and Chantret, N. (2017). Domestication reduces alternative splicing expression variations in sorghum. PLoS One 12:e0183454. doi: 10.1371/journal.pone.0183454 Renaut, S., and Rieseberg, L. H. (2015). The accumulation of deleterious mutations as a consequence of domestication and improvement in sunflowers and other compositae crops.Mol. Biol. Evol. 32, 2273–2283. doi: 10.1093/molbev/msv106 Rodríguez-Mega, E., Piñeyro-Nelson, A., Gutierrez, C., García-Ponce, B., Sánchez, M. D. L. P., Zluhan-Martínez, E., et al. (2015). Role of transcriptional regulation in the evolution of plant phenotype: a dynamic systems approach. Dev. Dyn. 244, 1074–1095. doi: 10.1002/dvdy.24268 Ross-Ibarra, J., Morrell, P. L., and Gaut, B. S. (2007). Plant domestication, a unique opportunity to identify the genetic basis of adaptation. Proc. Natl. Acad. Sci. U.S.A. 104, 8641–8648. doi: 10.1073/pnas.0700643104 Rozas, J., Ferrer-Mata, A., Sánchez-DelBarrio, J. C., Guirao-Rico, S., Librado, P., Ramos-Onsins, S. E., et al. (2017). DnaSP 6: DNA sequence polymorphism analysis of large data sets.Mol. Biol. Evol. 34, 3299–3302. doi: 10.1093/molbev/ msx248 Sabeti, P. C., Reich, D. E., Higgins, J. M., Levine, H. Z. P., Richter, D. J., Schaffner, S. F., et al. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832–837. doi: 10.1038/nature01140 Sabeti, P. C., Varilly, P., Fry, B., Lohmueller, J., Hostetter, E., Cotsapas, C., et al. (2007). Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913–918. doi: 10.1038/nature06250 Sakurada, K. (2010). Environmental epigenetic modifications and reprogramming- recalcitrant genes. Stem Cell Res. 4, 157–164. doi: 10.1016/j.scr.2010.01.001 Sawyer, S., Krause, J., Guschanski, K., Savolainen, V., and Pääbo, S. (2012). Temporal patterns of nucleotide misincorporations and DNA fragmentation in ancient DNA. PLoS One 7:e34131. doi: 10.1371/journal.pone.0034131 Sax, K. (1923). The association of size differences with seed-coat pattern and pigmentation in Phaseolus vulgaris. Genetics 8, 552–560. Schiffels, S., andDurbin, R. (2014). Inferring human population size and separation history frommultiple genome sequences.Nat. Genet. 46, 919–925. doi: 10.1038/ ng.3015 Schlötterer, C., Tobler, R., Kofler, R., and Nolte, V. (2014). Sequencing pools of individuals – mining genome-wide polymorphism data without big funding. Nat. Rev. Genet. 15, 749–763. doi: 10.1038/nrg3803 Schmidt, D.,Wilson, M. D., Spyrou, C., Brown, G. D., Hadfield, J., and Odom, D. T. (2009). ChIP-seq: using high-throughput sequencing to discover protein–DNA interactions.Methods 48, 240–248. doi: 10.1016/j.ymeth.2009.03.001 Schmitz, R. J., Schultz, M. D., Urich, M. A., Nery, J. R., Pelizzola, M., Libiger, O., et al. (2013). Patterns of population epigenomic diversity. Nature 495, 193–198. doi: 10.1038/nature11968 Schmutz, J., McClean, P. E., Mamidi, S., Wu, G. A., Cannon, S. B., Grimwood, J., et al. (2014). A reference genome for common bean and genome-wide analysis of dual domestications. Nat. Genet. 46, 707–713. doi: 10.1038/ng.3008 Schnable, P. S., Ware, D., Fulton, R. S., Stein, J. C., Wei, F., Pasternak, S., et al. (2009). The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112–1115. doi: 10.1126/science.1178534 Frontiers in Genetics | www.frontiersin.org 22 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication Schreiber, M., Stein, N., andMascher, M. (2018). Genomic approaches for studying crop evolution. Genome Biol. 19:140. doi: 10.1186/s13059-018-1528-8 Schubert, M., Jónsson, H., Chang, D., Der Sarkissian, C., Ermini, L., Ginolhac, A., et al. (2014). Prehistoric genomes reveal the genetic foundation and cost of horse domestication. Proc. Natl. Acad. Sci. U.S.A. 111, E5661–E5669. doi: 10.1073/pnas.1416991111 Schurch, N. J., Schofield, P., Gierliñski, M., Cole, C., Sherstnev, A., Singh, V., et al. (2016). How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA 22, 839–851. doi: 10.1261/rna.053959.115 Schweiger, W., Boddu, J., Shin, S., Poppenberger, B., Berthiller, F., Lemmens, M., et al. (2010). Validation of a candidate deoxynivalenol-inactivating UDP- glucosyltransferase from barley by heterologous expression in yeast.Mol. Plant Microbe Interact. 23, 977–986. doi: 10.1094/MPMI-23-7-0977 Seal, A., Gupta, A., Mahalaxmi, M., Aykkal, R., Singh, T. R., and Arunachalam, V. (2014). Tools, resources and databases for SNPs and indels in sequences: a review. Int. J. Bioinform. Res. Appl. 10:264. doi: 10.1504/IJBRA.2014.060762 Shafer, A. B. A., Peart, C. R., Tusso, S., Maayan, I., Brelsford, A., Wheat, C. W., et al. (2017). Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference. Methods Ecol. Evol. 8, 907–917. doi: 10.1111/2041-210X.12700 Shalem, O., Sanjana, N. E., Hartenian, E., Shi, X., Scott, D. A., Mikkelsen, T. S., et al. (2014). Genome-scale CRISPR-Cas9 knockout screening in human cells. Science 343, 84–87. doi: 10.1126/science.1247005 Shan, S., Soltis, P. S., Soltis, D. E., and Yang, B. (2020). Considerations in adapting CRISPR/Cas9 in nongenetic model plant systems. Appl. Plant Sci. 8:e11314. doi: 10.1002/aps3.11314 Shang, Y., Ma, Y., Zhou, Y., Zhang, H., Duan, L., Chen, H., et al. (2014). Biosynthesis, regulation, and domestication of bitterness in cucumber. Science 346, 1084–1088. doi: 10.1126/science.1259215 Shen, Y., Zhang, J., Liu, Y., Liu, S., Liu, Z., Duan, Z., et al. (2018). DNAmethylation footprints during soybean domestication and improvement. Genome Biol. 19:128. doi: 10.1186/s13059-018-1516-z Shi, J., and Lai, J. (2015). Patterns of genomic changes with crop domestication and breeding. Curr. Opin. Plant Biol. 24C, 47–53. doi: 10.1016/j.pbi.2015.01.008 Sims, D., Sudbery, I., Ilott, N. E., Heger, A., and Ponting, C. P. (2014). Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132. doi: 10.1038/nrg3642 Singh, J., Zhao, J., and Vallejos, C. E. (2018). Differential transcriptome patterns associated with early seedling development in a wild and a domesticated common bean (Phaseolus vulgaris L.) accession. Plant Sci. 274, 153–162. doi: 10.1016/j.plantsci.2018.05.024 Smith, J. M., and Haigh, J. (2007). The hitch-hiking effect of a favourable gene. Genet. Res. (Camb) 89, 391–403. doi: 10.1017/S0016672308009579 Smith, O., Nicholson, W. V., Kistler, L., Mace, E., Clapham, A., Rose, P., et al. (2019). A domestication history of dynamic adaptation and genomic deterioration in Sorghum. Nat. Plants 5, 369–379. doi: 10.1038/s41477-019- 0397-9 Sohn, J., and Nam, J.-W. (2016). The present and future of de novo whole-genome assembly. Brief. Bioinform. 19:bbw096. doi: 10.1093/bib/bbw096 Song, J., Li, J., Sun, J., Hu, T., Wu, A., Liu, S., et al. (2018). Genome-wide association mapping for cold tolerance in a core collection of rice (Oryza sativa L.) landraces by using high-density single nucleotide polymorphism markers from specific-locus amplified fragment sequencing. Front. Plant Sci. 9:875. doi: 10.3389/fpls.2018.00875 Song, Q., Zhang, T., Stelly, D. M., and Chen, Z. J. (2017). Epigenomic and functional analyses reveal roles of epialleles in the loss of photoperiod sensitivity during domestication of allotetraploid cottons. Genome Biol. 18:99. doi: 10. 1186/s13059-017-1229-8 Soyk, S., Lemmon, Z. H., Oved, M., Fisher, J., Liberatore, K. L., Park, S. J., et al. (2017). Bypassing negative epistasis on yield in tomato imposed by a domestication gene. Cell 169, 1142–1155. doi: 10.1016/j.cell.2017.04.032 Sun, H., Wu, S., Zhang, G., Jiao, C., Guo, S., Ren, Y., et al. (2017). Karyotype stability and unbiased fractionation in the paleo-allotetraploid cucurbita genomes.Mol. Plant 10, 1293–1306. doi: 10.1016/j.molp.2017.09.003 Swinnen, G., Goossens, A., and Pauwels, L. (2016). Lessons from domestication: targeting Cis -regulatory elements for crop improvement. Trends Plant Sci. 21, 506–515. doi: 10.1016/j.tplants.2016.01.014 Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595. Tang, H., Coram, M., Wang, P., Zhu, X., and Risch, N. (2006). Reconstructing genetic ancestry blocks in admixed individuals. Am. J. Hum. Genet. 79, 1–12. doi: 10.1086/504302 Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., et al. (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome.”. Proc. Natl. Acad. Sci. U.S.A. 102, 13950–13955. doi: 10.1073/pnas.0506758102 Thornton, T. A., and Bermejo, J. L. (2014). Local and global ancestry inference and applications to genetic association analysis for admixed populations.Genet. Epidemiol. 38, S5–S12. doi: 10.1002/gepi.21819 Tiffin, P., and Ross-Ibarra, J. (2014). Advances and limits of using population genetics to understand local adaptation. Trends Ecol. Evol. 29, 673–680. doi: 10.1016/j.tree.2014.10.004 Tomato Genome Consortium (2012). The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485:635. doi: 10.1038/nature11119 Trerotola, M., Relli, V., Simeone, P., and Alberti, S. (2015). Epigenetic inheritance and the missing heritability. Hum. Genomics 9:17. doi: 10.1186/s40246-015- 0041-3 Trucchi, E., Gratton, P., Whittington, J. D., Cristofari, R., Le Maho, Y., Stenseth, N. C., et al. (2014). King penguin demography since the last glaciation inferred from genome-wide data. Proc. R. Soc. B Biol. Sci. 281:20140528. doi: 10.1098/ rspb.2014.0528 Turner, B. M. (2009). Epigenetic responses to environmental change and their evolutionary implications. Philos. Trans. R. Soc. Lond. B Biol. Sci. 364, 3403– 3418. doi: 10.1098/rstb.2009.0125 Ueta, R., Abe, C., Watanabe, T., Sugano, S. S., Ishihara, R., Ezura, H., et al. (2017). Rapid breeding of parthenocarpic tomato plants using CRISPR/Cas9. Sci. Rep. 7:507. doi: 10.1038/s41598-017-00501-4 Urich, M. A., Nery, J. R., Lister, R., Schmitz, R. J., and Ecker, J. R. (2015). MethylC- seq library preparation for base-resolution whole-genome bisulfite sequencing. Nat. Protoc. 10:475. doi: 10.1038/nprot.2014.114 Vallebueno-Estrada, M., Rodríguez-Arévalo, I., Rougon-Cardoso, A., Martínez González, J., García Cook, A., Montiel, R., et al. (2016). The earliest maize from San Marcos Tehuacán is a partial domesticate with genomic evidence of inbreeding. Proc. Natl. Acad. Sci. U.S.A. 113, 14151–14156. doi: 10.1073/pnas. 1609701113 VanBuren, R., Wai, C. M., Colle, M., Wang, J., Sullivan, S., Bushakra, J. M., et al. (2018). A near complete, chromosome-scale assembly of the black raspberry (Rubus occidentalis) genome. Gigascience 7, 1–9. doi: 10.1093/gigascience/ giy094 Varshney, R. K., Thudi, M., Roorkiwal, M., He, W., Upadhyaya, H. D., Yang, W., et al. (2019). Resequencing of 429 chickpea accessions from 45 countries provides insights into genome diversity, domestication and agronomic traits. Nat. Genet. 51, 857–864. doi: 10.1038/s41588-019-0401-3 Vasemägi, A., Nilsson, J., McGinnity, P., Cross, T., O’Reilly, P., Glebe, B., et al. (2012). Screen for footprints of selection during domestication/captive breeding of atlantic salmon. Comp. Funct. Genomics 2012, 1–14. doi: 10.1155/2012/ 628204 Velasco, D., Hough, J., Aradhya, M., and Ross-Ibarra, J. (2016). Evolutionary genomics of peach and almond domestication. G3 6, 3985–3993. doi: 10.1534/ g3.116.032672 Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., et al. (2017). 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22. doi: 10.1016/j.ajhg.2017.06.005 Vitti, J. J., Grossman, S. R., and Sabeti, P. C. (2013). Detecting natural selection in genomic data. Annu. Rev. Genet. 47, 97–120. doi: 10.1146/annurev-genet- 111212-133526 Voight, B. F., Kudaravalli, S., Wen, X., and Pritchard, J. K. (2006). A map of recent positive selection in the human genome. PLoS Biol. 4:e72. doi: 10.1371/journal. pbio.0040072 Wales, N., Akman, M., Watson, R. H. B., Sánchez Barreiro, F., Smith, B. D., Gremillion, K. J., et al. (2019). Ancient DNA reveals the timing and persistence of organellar genetic bottlenecks over 3,000 years of sunflower domestication and improvement. Evol. Appl. 12, 38–53. doi: 10.1111/eva.12594 Wales, N., Ramos Madrigal, J., Cappellini, E., Carmona Baez, A., Samaniego Castruita, J. A., Romero-Navarro, J. A., et al. (2016). The limits and potential Frontiers in Genetics | www.frontiersin.org 23 July 2020 | Volume 11 | Article 742 Barrera-Redondo et al. Genomic Tools to Study Domestication of paleogenomic techniques for reconstructing grapevine domestication. J. Archaeol. Sci. 72, 57–70. doi: 10.1016/j.jas.2016.05.014 Walley, J. W., Sartor, R. C., Shen, Z., Schmitz, R. J., Wu, K. J., Urich, M. A., et al. (2016). Integration of omic networks in a developmental atlas of maize. Science 353, 814–818. doi: 10.1126/science.aag1125 Wang, E. T., Kodama, G., Baldi, P., and Moyzis, R. K. (2006). Global landscape of recent inferred Darwinian selection for Homo sapiens. Proc. Natl. Acad. Sci. U.S.A. 103, 135–140. doi: 10.1073/pnas.0509691102 Wang, G.-D., Xie, H.-B., Peng, M.-S., Irwin, D., and Zhang, Y.-P. (2014). Domestication genomics: evidence from animals. Annu. Rev. Anim. Biosci. 2, 65–84. doi: 10.1146/annurev-animal-022513-114129 Wang, M., Yan, J., Zhao, J., Song, W., Zhang, X., Xiao, Y., et al. (2012). Genome- wide association study (GWAS) of resistance to head smut in maize. Plant Sci. 196, 125–131. doi: 10.1016/j.plantsci.2012.08.004 Wang, W., Feng, B., Xiao, J., Xia, Z., Zhou, X., Li, P., et al. (2014). Cassava genome from a wild ancestor to cultivated varieties. Nat. Commun. 5:5110. doi: 10.1038/ncomms6110 Wang, W., Zhang, X., Zhou, X., Zhang, Y., La, Y., Zhang, Y., et al. (2019). Deep genome resequencing reveals artificial and natural selection for visual deterioration, plateau adaptability and high prolificacy in chinese domestic sheep. Front. Genet. 10:300. doi: 10.3389/fgene.2019.00300 Warr, A., Robert, C., Hume, D., Archibald, A., Deeb, N., and Watson, M. (2015). Exome sequencing: current and future perspectives. G3 5, 1543–1550. doi: 10.1534/g3.115.018564 Weber, M., Davies, J. J., Wittig, D., Oakeley, E. J., Haase, M., Lam, W. L., et al. (2005). Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat. Genet. 37, 853–862. doi: 10.1038/ng1598 Weigel, D., and Colot, V. (2012). Epialleles in plant evolution. Genome Biol. 13, 249. doi: 10.1186/gb-2012-13-10-249 Wiener, P., and Pong-Wong, R. (2011). A regression-based approach to selection mapping. J. Hered. 102, 294–305. doi: 10.1093/jhered/esr014 Wigginton, J. E., Cutler, D. J., and Abecasis, G. R. (2005). A note on exact tests of hardy-weinberg equilibrium. Am. J. Hum. Genet. 76, 887–893. doi: 10.1086/ 429864 Wilkinson, S., Lu, Z. H., Megens, H.-J., Archibald, A. L., Haley, C., Jackson, I. J., et al. (2013). Signatures of diversifying selection in european pig breeds. PLoS Genet. 9:e1003453. doi: 10.1371/journal.pgen.1003453 Wolter, F., Schindele, P., and Puchta, H. (2019). Plant breeding at the speed of light: the power of CRISPR/Cas to generate directed genetic diversity at multiple sites. BMC Plant Biol. 19:176. doi: 10.1186/s12870-019-1775-1 Wu, D.-D., Ding, X.-D., Wang, S., Wójcik, J. M., Zhang, Y., Tokarska, M., et al. (2018). Pervasive introgression facilitated domestication and adaptation in the Bos species complex. Nat. Ecol. Evol. 2, 1139–1145. doi: 10.1038/s41559-018- 0562-y Xie, M., Chung, C. Y.-L., Li, M., Wong, F.-L., Wang, X., Liu, A., et al. (2019). A reference-grade wild soybean genome. Nat. Commun. 10:1216. doi: 10.1038/ s41467-019-09142-9 Yandell, M., and Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nat. Rev. Genet. 13, 329–342. doi: 10.1038/nrg3174 Yang, C. J., Samayoa, L. F., Bradbury, P. J., Olukolu, B. A., Xue, W., York, A. M., et al. (2019). The genetic architecture of teosinte catalyzed and constrained maize domestication. Proc. Natl. Acad. Sci. U.S.A. 116, 5643–5652. doi: 10.1073/ pnas.1820997116 Yang, I. S., and Kim, S. (2015). Analysis of whole transcriptome sequencing data: workflow and software. Genomics Inform. 13:119. doi: 10.5808/GI.2015.13.4.119 Yang, L., Koo, D. H., Li, Y., Zhang, X., Luan, F., Havey, M. J., et al. (2012). Chromosome rearrangements during domestication of cucumber as revealed by high-density genetic mapping and draft genome assembly. Plant J. 71, 895–906. doi: 10.1111/j.1365-313X.2012.05017.x Yang, X., Zhang, H., Shang, J., Liu, G., Xia, T., Zhao, C., et al. (2018). Comparative analysis of the blood transcriptomes between wolves and dogs.Anim. Genet. 49, 291–302. doi: 10.1111/age.12675 Zadesenets, K. S., and Rubtsov, N. B. (2018). Genome duplication in animal evolution. Russ. J. Genet. 54, 1125–1136. doi: 10.1134/S10227954180 90168 Zeder, M. A. (2006). Central questions in the domestication of plants and animals. Evol. Anthropol. Issues News Rev. 15, 105–117. doi: 10.1002/evan.20101 Zeder, M. A., Emshwiller, E., Smith, B. D., and Bradley, D. G. (2006). Documenting domestication: the intersection of genetics and archaeology. Trends Genet. 22, 139–155. doi: 10.1016/j.tig.2006.01.007 Zeng, K., Fu, Y.-X., Shi, S., and Wu, C.-I. (2006). Statistical tests for detecting positive selection by utilizing high-frequency variants.Genetics 174, 1431–1439. doi: 10.1534/genetics.106.061432 Zeng, L., Tu, X.-L., Dai, H., Han, F.-M., Lu, B.-S., Wang, M.-S., et al. (2019). Whole genomes and transcriptomes reveal adaptation and domestication of pistachio. Genome Biol. 20:79. doi: 10.1186/s13059-019-1686-3 Zhang, C., Bailey, D. K., Awad, T., Liu, G., Xing, G., Cao, M., et al. (2006). A whole genome long-range haplotype (WGLRH) test for detecting imprints of positive selection in human populations. Bioinformatics 22, 2122–2128. doi: 10.1093/bioinformatics/btl365 Zhang, H., Zhang, J., Lang, Z., Botella, J. R., and Zhu, J.-K. (2017). Genome editing—principles and applications for functional genomics research and crop improvement. CRC. Crit. Rev. Plant Sci. 36, 291–309. doi: 10.1080/07352689. 2017.1402989 Zhao, B. S., Roundtree, I. A., and He, C. (2017). Post-transcriptional gene regulation by mRNA modifications. Nat. Rev. Mol. Cell. Biol. 18, 31–42. doi: 10.1038/nrm.2016.132 Zhao, Q., Feng, Q., Lu, H., Li, Y., Wang, A., Tian, Q., et al. (2018). Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice. Nat. Genet. 50, 278–284. doi: 10.1038/s41588-018-0041-z Zheng, J., Wu, H., Zhu, H., Huang, C., Liu, C., Chang, Y., et al. (2019). Determining factors, regulation system, and domestication of anthocyanin biosynthesis in rice leaves. New Phytol. 223, 705–721. doi: 10.1111/nph. 15807 Zhou, J., Li, D., Wang, G., Wang, F., Kunjal, M., Joldersma, D., et al. (2019). Application and future perspective of CRISPR/Cas9 genome editing in fruit crops. J. Integr. Plant Biol. 62, 269–286. doi: 10.1111/jipb. 12793 Zhou, Y., Minio, A., Massonnet, M., Solares, E., Lv, Y., Beridze, T., et al. (2019). The population genetics of structural variants in grapevine domestication. Nat. Plants 5, 965–979. doi: 10.1038/s41477-019-0507-8 Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Copyright © 2020 Barrera-Redondo, Piñero and Eguiarte. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. Frontiers in Genetics | www.frontiersin.org 24 July 2020 | Volume 11 | Article 742 42 CAPÍTULO 2: ENSAMBLE DEL GENOMA DE REFERENCIA DE Cucurbita argyrosperma Y DINÁMICA EVOLUTIVA DE LAS FAMILIAS GÉNICAS CODIFICANTES Y NO CODIFICANTES EN EL GÉNERO Cucurbita Artículo de investigación: The genome of Cucurbita argyrosperma (silver-seed gourd) reveals faster rates of protein-coding gene and long noncoding RNA turnover and neofunctionalization within Cucurbita Este capítulo trata sobre el ensamble del genoma de referencia de la calabaza domesticada Cucurbita argyrosperma subsp. argyrosperma y su uso para entender el ritmo y modo en que evolucionan los genes codificantes, los ARNs largos intergénicos no codificantes y los pseudogenes (i.e., genes codificantes que perdieron su marco abierto de lectura, pero siguen siendo transcripcionalmente activos) después de una duplicación completa del genoma que había sido previamente descrita en las calabazas. El genoma de C. argyrosperma se generó usando las tecnologías de secuenciación de Illumina y PacBio, obteniendo un ensamble de alta calidad. También se secuenció ARN de cinco órganos distintos para poder predecir los genes codificantes y no codificantes en el genoma de C. argyrosperma. Mediante un análisis de genómica comparada, se encontró que la tasa de recambio de genes codificantes y ARNs largos no codificantes es más rápida dentro de los genomas de calabazas que en los demás genomas previamente reportados para la familia Cucurbitaceae, posiblemente como resultado de la redundancia funcional de la duplicación completa del genoma que facilitó la co-opción y reemplazo de genes. Finalmente, se encontró que algunos de los genes que perdieron su marco abierto de lectura se mantienen conservados estructural y filogenéticamente en los genomas de calabazas, además de que siguen siendo transcripcionalmente activos, sugiriendo que estos elementos pudieran ser funcionales, posiblemente como elementos regulatorios. Este capítulo fue publicado en la revista Molecular Plant en el número del mes de abril del 2019. The Genome of Cucurbita argyrosperma (Silver- Seed Gourd) Reveals Faster Rates of Protein- Coding Gene and Long Noncoding RNA Turnover and Neofunctionalization within Cucurbita Josué Barrera-Redondo1, Enrique Ibarra-Laclette2, Alejandra Vázquez-Lobo3, Yocelyn T. Gutiérrez-Guerrero1, Guillermo Sánchez de la Vega1, Daniel Piñero1, Salvador Montes-Hernández4, Rafael Lira-Saade5,* and Luis E. Eguiarte1,* 1Departamento de Ecologı́a Evolutiva, Instituto de Ecologı́a, Universidad Nacional Autónoma de México, Circuito Exterior s/n Anexo al Jardı́n Botánico, 04510 Ciudad de México, Mexico 2Departamento de Estudios Moleculares Avanzados, Instituto de Ecologı́a A.C., Carretera Antigua a Coatepec No. 351, Col. El Haya. C.P., Xalapa, Veracruz 91070, Mexico 3Centro de Investigaciones en Biodiversidad y Conservación, Universidad Autónoma del Estado de Morelos, Av. Universidad 1001, Col. Chamilpa, Cuernavaca, Morelos 62209, Mexico 4Campo Experimental Bajı́o, Instituto Nacional de Investigaciones Forestales, Agrı́colas y Pecuarias (INIFAP), Km 6.5 Carretera Celaya-San Miguel de Allende, Celaya, Guanajuato 38110, Mexico 5UBIPRO, Facultad de Estudios Superiores Iztacala, Universidad Nacional Autónoma de México, Av. de los Barrios #1, Col. Los Reyes Iztacala, Tlanepantla, Edo. de Mex 54090, Mexico *Correspondence: Rafael Lira-Saade (rlira@unam.mx), Luis E. Eguiarte (fruns@unam.mx) https://doi.org/10.1016/j.molp.2018.12.023 ABSTRACT Whole-genome duplications are an important source of evolutionary novelties that change the mode and tempo at which genetic elements evolve within a genome. The Cucurbita genus experienced a whole- genome duplication around 30 million years ago, although the evolutionary dynamics of the coding and noncoding genes in this genus have not yet been scrutinized. Here, we analyzed the genomes of four Cucurbita species, including a newly assembled genome of Cucurbita argyrosperma, and compared the gene contents of these species with those of five other members of the Cucurbitaceae family to assess the evolutionary dynamics of protein-coding and long intergenic noncoding RNA (lincRNA) genes after the genome duplication. We report that Cucurbita genomes have a higher protein-coding gene birth–death rate compared with the genomes of the other members of the Cucurbitaceae family. C. argyr- osperma gene families associated with pollination and transmembrane transport had significantly faster evolutionary rates. lincRNA families showed high levels of gene turnover throughout the phylogeny, and 67.7% of the lincRNA families in Cucurbita showed evidence of birth from the neofunctionalization of pre- viously existing protein-coding genes. Collectively, our results suggest that the whole-genome duplica- tion in Cucurbita resulted in faster rates of gene family evolution through the neofunctionalization of duplicated genes. Key words:Cucurbita argyrosperma, comparative genomics, molecular evolution, neofunctionalization, long non- coding RNA, whole-genome duplication Barrera-Redondo J., Ibarra-Laclette E., Vázquez-Lobo A., Gutiérrez-Guerrero Y.T., Sánchez de la Vega G., Piñero D., Montes-Hernández S., Lira-Saade R., and Eguiarte L.E. (2019). The Genome of Cucurbita argyrosperma (Silver-Seed Gourd) Reveals Faster Rates of Protein-Coding Gene and Long Noncoding RNA Turnover and Neofunctionalization within Cucurbita. Mol. Plant. 12, 506–520. Published by the Molecular Plant Shanghai Editorial Office in association with Cell Press, an imprint of Elsevier Inc., on behalf of CSPB and IPPE, SIBS, CAS. 506 Molecular Plant 12, 506–520, April 2019 ª The Author 2019. Molecular Plant Research Article INTRODUCTION Cucurbita is a genus with global agronomic relevance (Lira et al., 2016; Paris, 2016) and is one of the angiosperm genera with the highest number of independent domestication events (Nee, 1990; Zheng et al., 2013; Castellanos-Morales et al., 2018). Recent advances in the study of Cucurbita spp. genomes revealed a recent whole-genome duplication around 30 million years ago (Mya) in the common ancestor of the genus (Montero-Pau et al., 2017; Sun et al., 2017). Genome duplications are important sources of evolutionary nov- elties in plants, since redundant elements in a genome can develop novel functions in a process called neofunctionalization (Ganfornina and Sánchez, 1999; Magadum et al., 2013). It is expected that a lineage that experienced a recent whole- genome duplication would have different gene evolution dynamics compared with other closely related species with non-duplicated genomes (Ponting et al., 2009; Magadum et al., 2013). Even though the genomic footprints of a whole-genome duplica- tion are strong in Cucurbita, the numbers of predicted protein- coding genes in Cucurbita genomes are roughly similar to those in other genomes of the Cucurbitaceae family (Huang et al., 2009; Garcia-Mas et al., 2012; Guo et al., 2012; Montero-Pau et al., 2017; Sun et al., 2017; Urasaki et al., 2017; Wu et al., 2017). This apparent lack of duplicated coding genes could be the result of ‘‘pseudogenization’’ processes, implying a loss of redundant genes throughout the evolution of the Cucurbita genomes, either by the accumulation of mutations resulting in loss of function or fractionation due to intrachromosomal recombination (Sun et al., 2017). However, duplicated coding genes can also evolve to perform novel functions through positive selection (Wang et al., 2015). These novel functions are not necessarily limited to the emergence of new protein-coding elements, since protein-coding genes can also evolve into regu- latory elements as noncoding RNAs (Chen and Rajewsky, 2007). Most of the transcriptional activity in eukaryotes generates noncoding RNAs (Smith and Mattick, 2017), whose abundance correlates positively with organismal complexity, whereas the abundance of protein-coding genes does not scale with complexity (Liu et al., 2013). A particular category of noncoding transcripts called long noncoding RNAs (lncRNAs) seem to play critical roles in eukaryotic development and differentiation (Smith and Mattick, 2017). lncRNAs are a heterogeneous group of noncoding RNAs larger than 200 nucleotides that lack coding potential (Mercer et al., 2009; Ulitsky, 2016). lncRNAs act as master regulatory genes, mainly through the recruitment of chromatin modifiers in the nucleus, such as DNA methyltransferases and histone posttranslational modifiers (Fatica and Bozzoni, 2014; Smith and Mattick, 2017), although lncRNAs can also act in the cytoplasm through sequence complementarity to other RNA molecules and the modulation of mRNA stability (Fatica and Bozzoni, 2014). In plants, lncRNAs are involved in several biological functions, including organ development, flowering and vernalization, phosphate homeostasis, photomorphogenesis, response to biotic and abiotic stress conditions such as heat stress and response to phytopathogens, alternative splicing of protein- coding genes, nodule formation, and cell-wall synthesis (Chekanova, 2015; Liu et al., 2015). Studies regarding the evolutionary dynamics of lncRNAs are limited despite their evident importance in plant biology, due to a traditional focus on protein-coding genes in genome-wide studies (Ulitsky, 2016; Nelson et al., 2017). Furthermore, evolutionary analyses of these genes have been limited due to a lack of conservation at both the sequence level and the secondary structure level (Ulitsky, 2016), and research on their origin and evolution is still scarce (see Necsulea et al., 2014; Nelson et al., 2016; Zhao et al., 2018). Several hypotheses have been proposed to explain the emergence of new lncRNAs, such as the neofunctionalization of duplicated protein-coding genes, co-option of transposable elements in the genome, dupli- cation followed by neofunctionalization from other lncRNAs, and de novo emergence (Kapusta et al., 2013). Previous studies have suggested that whole-genome duplications can lead to faster rates of evolution in lncRNA families (Ponting et al., 2009; Nelson and Shippen, 2015). This study focuses on the effects of the Cucurbita-wide genome duplication on the evolutionary dynamics of both protein-coding and long intergenic noncoding RNA (lincRNA) genes in theCucur- bita genus. We propose that the Cucurbita genomes have faster gene evolutionary dynamics, including higher rates of gene birth and death, than the genomes of other members of the Cucurbita- ceae family due to this whole-genome duplication. We expected that species belonging to the Cucurbita genus might have expe- rienced several recent lincRNA birth events due to the whole- genome duplication (Ponting et al., 2009; Nelson and Shippen, 2015). We also analyzed the possibility that the duplication of protein-coding genes and posterior neofunctionalization may be a source of new lincRNA genes in Cucurbita (Kapusta et al., 2013). We explored these hypotheses by analyzing four Cucurbita genomes, including our novel genome assembly of Cucurbita argyrosperma ssp. argyrosperma, commonly known as cushaw or silver-seed gourd in English and ‘‘calabaza pipiana’’ in Spanish (Lira et al., 2016), by comparing the coding and noncoding genes in these genomes with those in other genome assemblies reported for the Cucurbitaceae family. RESULTS Cucurbita argyrosperma Genome and Transcriptome We sequenced the genome of C. argyrosperma ssp. argyro- sperma using three sequencing platforms: Illumina HiSeq2000, Illumina MiSeq, and PacBio RS II (see Supplemental Methods for detailed information on the genome and transcriptome sequencing). We obtained 38.4 Gb of data from HiSeq2000, 13.1 Gb from MiSeq, and 11.4 Gb from PacBio RS II. After applying quality filters to the data (see Supplemental Methods for detailed parameters) and filtering organelle reads, we obtained 1203 high-quality sequence coverage with the Illumina reads and 313 coverage with the PacBio reads. We estimated the genome size of C. argyrosperma to be ca. 238 Mb using KmerGenie (Chikhi and Medvedev, 2014). The chloroplast genome was assembled into a single circular contig of 157 623 bp and had the typical structures of a Molecular Plant 12, 506–520, April 2019 ª The Author 2019. 507 The Genome of Cucurbita argyrosperma Molecular Plant chloroplast genome: a large single-copy region, a small single- copy region, and two inverted repeats (Daniell et al., 2016). The mitochondrial genome was assembled into 17 scaffolds composed of 1 062 053 bp and showed several instances of chloroplast sequence insertions, as previously described for the mitochondrial genome of Cucurbita pepo (Alverson et al., 2010). We assembled the C. argyrosperma nuclear genome into 920 scaffolds (1481 contigs), with an N50 of 620 880 bp (Table 1). The total length of the assembled scaffolds was 229 Mbp, around 96% of the estimated size of the genome and similar to the assembly size of previously reported Cucurbita genomes (Montero-Pau et al., 2017; Sun et al., 2017). Genome completeness was assessed by finding single-copy orthologous genes conserved in embryophytes (1440) using BUSCO (Simão et al., 2015). We found complete sequences for 93.2% (1342) of the BUSCO genes and fragmented sequences for 0.9% (13) of the BUSCO genes within the C. argyrosperma genome assembly, suggesting a high level of assembly completeness. We also found that 80.5% of the Illumina reads and 100% of the PacBio reads used for genome assembly mapped against the assembled scaffolds, indicating that most of the sequenced genome is present in the nuclear assembly. We sequenced the C. argyrosperma transcriptome using Illumina HiSeq2000, obtaining 51 Gb of RNA sequencing (RNA-seq) data. Wemapped 90.69%of the transcriptome reads back to either the nuclear or the organelle assemblies, indicating a high level of genome completeness. The transcriptome was assembled both de novo (Grabherr et al., 2011) and using a genome-guided as- sembly (Pertea et al., 2015) to aid in the prediction of gene models. We predicted 4387 tRNAs and 28 298 protein-coding genes within the genome assembly, numbers similar to those re- ported in C. pepo, Cucumis sativus, and Cucumis melo (Huang et al., 2009; Garcia-Mas et al., 2012; Montero-Pau et al., 2017) (Table 1). 78.9% of the protein-coding genes were functionally annotated using InterProScan (Jones et al., 2014; Supplemental Data 4), where 51.7% of the genes could be assigned with at least one gene ontology (GO) term (Ashburner et al., 2000; The Gene Ontology Consortium, 2017). We predicted78Mbp of transposable elements (TEs) within the C.argyrospermagenome, corresponding to 34.1%of thegenome assembly. This proportion of TEs is similar to those found within the genomes of C. pepo (93 Mbp, 37.8% of the genome assem- bly), Cucurbita maxima (107 Mbp, 40.3%), and Cucurbita mo- schata (106 Mbp, 40.6%) (Montero-Pau et al., 2017; Sun et al., 2017). Of the 78 Mbp of TEs, 93.6% correspond to RNA transposons, with most being LTR retrotransposons (49.09%) and LARD retrotransposons (29.36%). Just 1.95% of the observed TEs correspond to DNA transposons, and 4.44% correspond to unidentifiable TEs (Supplemental Table 1). The dominance of LTR retrotransposons within the genome of C. argyrosperma is similar to that in C. pepo (50.7%) (Montero-Pau et al., 2017), C. moschata (62.9%), and C. maxima (69.9%) (Sun et al., 2017), revealing that TE families remained relatively stable within the Cucurbita genus. Phylogeny and Evolution of Protein-Coding Gene Families We compared the protein-coding genes of C. argyrosperma with those of C. pepo (Montero-Pau et al., 2017), C. moschata, and C. maxima (Sun et al., 2017); as well as other genera in the Cucurbitaceae family, C. sativus (Huang et al., 2009), C. melo (Garcia-Mas et al., 2012), Citrullus lanatus (Levi et al., 2011), Lagenaria siceraria (Wu et al., 2017), and Momordica charantia (Urasaki et al., 2017), to assess protein-coding gene family ex- pansions and contractions within the Cucurbitaceae family, as well as within the Cucurbita genus. We used Fragaria vesca (Edger et al., 2018) and Juglans regia (Martı́nez-Garcı́a et al., 2016) as outgroups. We retrieved 23 247 protein-coding gene families, of which only 11 961 families included at least one homolog conserved in two or more different species; the remaining families were exclusive to a single species (Supplemental Figure 1). We found 698 gene families that remained multicopy within Cucurbita after the whole-genome duplication, and these families were functionally enriched (p < 0.01) in intracellular protein transport. We also found 858 gene families that remained a constant size throughout Cucurbita and the other Cucurbitaceae species, although we found no functional enrichment within these families. We identi- fied 369 gene families as single-copy orthologs conserved in all Cucurbitaceae and outgroup species, which were used to obtain a time-calibrated phylogeny needed for gene family evolution analyses. The resulting species tree had approximate likelihood-ratio test (Ansimova and Gascuel, 2006) support Assembly size 228 814 150 bp No. of scaffolds 920 Longest scaffold 2 746 581 bp N50 of scaffolds 620 880 bp L50 of scaffolds 103 No. of scaffolds >1 kbp 920 (100.0%) No. of scaffolds >10 kbp 903 (98.2%) No. of scaffolds >100 kbp 455 (49.5%) No. of contigs 1481 Longest contig 2 172 140 bp N50 of contigs 463 388 bp L50 of contigs 132 No. of contigs >1 kbp 1481 (100.0%) No. of contigs >10 kbp 1417 (95.7%) No. of contigs >100 kbp 493 (33.3%) CG content 36.22% Illumina read coverage 1203 PacBio read coverage 313 No. of protein-coding genes 28 298 Protein-coding gene average size 3457 bp Protein-coding gene median size 2627 bp No. of tRNAs 4387 No. of long noncoding intergenic RNAs 6124 Table 1. Cucurbita argyrosperma ssp. argyrosperma Genome Assembly Statistics. 508 Molecular Plant 12, 506–520, April 2019 ª The Author 2019. Molecular Plant The Genome of Cucurbita argyrosperma values of 100% at every node. The dated phylogeny with hishest posteior densities (HPS ± 95% confidence interval) obtained using mcmctree (Yang, 2007) supports a divergence time between C. argyrosperma and its sister species C. moschata of around 3.98 ± 1.7 Mya, while the divergence between Cucurbita and Benincaseae (C. sativus + C. melo + C. lanatus + L. siceraria; Schaefer et al., 2009) happened around 32.9 ± 11 Mya (Figure 1), concordant with the expected age of the whole- genome duplication event in Cucurbita, approximately 30 ± 4 Mya (Montero-Pau et al., 2017). The crown node of the included Cucurbitaceae species was dated at 44.1 ± 14 Mya. We performed likelihood-ratio tests to compare the likelihood score of a global gene birth–death rate parameter (l) across the tree against multiple l values throughout the phylogeny. A model with a change in l within Cucurbita had a significantly higher log-likelihood (193 746.504) than a single l (198 328.178) throughout the tree (Figure 1 and Supplemental Figure 2). After accounting for genome assembly and annotation error rates, we found that the gene birth–death rate was twice as high in Cucurbita (l = 0.01397) than in the rest of the phylogeny (l = 0.00659). We detected phylogenetic inconsistencies in gene content within Cucurbita, with some genomes containing 32 000 protein- coding genes and others containing 28 000 genes (Supplemental Table 2). To account for possible errors in gene prediction, we repeated the gene family analysis using only high-quality protein-coding gene predictions with annotation edit distances (eAED) lower than 0.5 (Yandell and Ence, 2012; Campbell et al., 2014) to eliminate low-quality gene models. We found a similar number of high-quality gene models in all Cucur- bita genomes, which is much closer to the total number of pre- dicted genes in C. pepo and C. argyrosperma (Supplemental Table 2). Even after discarding low-quality gene models, l was still twice as high inCucurbita (0.01188) than in the rest of the phy- logeny (0.00566) (Supplemental Figures 3 and 4). We found significantly rapid changes in gene family sizes (p < 0.01) throughout most of the branches within the phylogeny (Figure 1). We found that the branch leading to the crown node of the Cucurbita genus and the terminal branch of C. argyrosperma had unusually high rates of gene family evolution (Figure 1). Even though just a small number of gene families showed significantly rapid (p < 0.01) levels of change in the branch leading to the crown node of Cucurbita (six gene families), this branch had the second highest number of gene family changes within the entire phylogeny (Figure 1). Surprisingly, the terminal branch of M. charantia showed the highest number of gene family changes in the whole phylogeny (Figure 1), although there were only a few gene families with significantly rapid changes (27 gene families). Furthermore, the proportion of gene families that either expanded or contracted in the terminal branches of Cucurbita was higher compared with the proportion of gene families that either expanded or contracted in the terminal branches of Benincaseae (Figure 1). The terminal branch of C. argyrosperma had an unusually high number of gene families with significantly rapid expansions/ contractions (327 families). However, most of the rapidly evolving gene families within C. argyrosperma underwent contrac- tions (78.3%), rather than expansions (21.7%). After performing Figure 1. Dated Phylogeny of the Cucurbitaceae Family with Protein-Coding Gene Family Expansions and Contractions per Branch. The phylogeny was generated with 369 single-copy orthologous genes. Fossil evidence was used to calibrate the basal node of the tree (green hexagon). The pie charts and the percentages at every branch of the tree indicate whether a gene family expanded (red), contracted (blue), or remained the same size (gray). The yellow stars indicate the estimated ages of the whole-genome duplication events in theCucurbita genus (Montero-Pau et al., 2017) and in Juglans regia (Luo et al., 2015). The black arrows indicate the change from a basal gene birth/death rate (l) in the most recent common ancestor of the phylogeny to a faster gene birth/death rate after the whole-genome duplication in Cucurbita. Every node in the phylogeny has an approximate likelihood- ratio test (aLRT) support value of 100%. Molecular Plant 12, 506–520, April 2019 ª The Author 2019. 509 The Genome of Cucurbita argyrosperma Molecular Plant a GO enrichment analysis, we found four overrepresented biolog- ical functions associated with the significantly expanded families in C. argyrosperma (Table 2), including microtubule-based move- ment in families mainly composed of proteins with kinesin motor domains, and drug transmembrane transport in families mainly composed of villin/gelsolin proteins. We also found 13 overrepre- sented biological functions associated with the significantly con- tracted families in C. argyrosperma (Table 2), including cell-wall modification in families mainly composed of pectinesterases, response to oxidative stress, oxidation–reduction processes, recognition of pollen, exocytosis, and several processes associ- ated with transmembrane transport. Curiously, drug transmem- brane transport was enriched in both significantly expanded fam- ilies and significantly contracted families, which were composed mainly of multi-antimicrobial extrusion proteins. lincRNA Prediction and Analysis We used Evolinc-I (Nelson et al., 2017) to predict lincRNAs within the genome assembly of C. argyrosperma, as well as the genomes of C. maxima, C. moschata, C. pepo, C. melo, C. sativus, C. lanatus, and L. siceraria. The predicted lincRNAs were compared against the protein-coding gene transcripts of each genome to determine the percentage of lincRNAs produced from the neofunctionalization of duplicated protein-coding se- quences. We also compared the predicted lincRNAs against the RepBase (Bao et al., 2015) sequences from eudicots to determine the percentage of lincRNAs produced from the neofunctionalization of TEs within each genome. Since most of the species transcriptomes used to predict lincRNAs had differences in the organs sequenced, as well as differences in sequencing depth and library construction (see Supplemental Table 3), each species had a different number of predicted lincRNAs (Supplemental Table 2), and these numbers could not be directly compared. However, we expect that the proportion of lincRNAs derived from protein-coding genes and TEs relative to the total number of predicted lincRNAs in a genome remains relatively constant, despite differences in the RNA-seq strategy. Hence,wecompared this proportion betweenCucurbita species and the other species within Cucurbitaceae. Despite the differences in the number of predicted lincRNAs, the percentage of protein-coding-derived lincRNAs was roughly similar between Cucurbita species, while there was more variance in this proportion between the other cucurbits (Figure 2). We found a higher percentage of protein-coding-derived lincRNAs in Cucurbita species than in other cucurbits (Figure 2; p = 0.041), which fits the coding-to-noncoding neofunctionalization hypothe- sis (Kapusta et al., 2013). However, the proportion of TE-derived lincRNAs was lower in the Cucurbita genus compared with the other taxa of the Cucurbitaceae family (Figure 2; p = 0.013). We analyzed the evolution of lincRNA families across the Cucurbitaceae family by using the C. argyrosperma predicted lincRNAs as queries to search for homologs within all the analyzed genomes. We compared the lincRNA homologs of each lincRNA family against the protein-coding genes and the Evolinc-I predicted lincRNAs for each species to assess the relationship between lincRNAs and protein-coding genes, as well as to assess the transcriptional potential of the lincRNA ho- mologs. Since the evolutionary proximity of C. argyrosperma to the other Cucurbita species can lead to erroneous inferences about lincRNA family expansion in this genus, we also used the predicted lincRNAs of C. lanatus as sequence queries in the GO ID GO term FDR p value Significantly expanded protein-coding gene families GO:0007018 Microtubule-based movement <1.729E27 GO:0006270 DNA replication initiation 7.8E16 GO:0006855 Drug transmembrane transport 4.6E05 GO:0007010 Cytoskeleton organization 0.00346 Significantly contracted protein-coding gene families GO:0042545 Cell-wall modification <1.729E27 GO:0006979 Response to oxidative stress <1.729E27 GO:0055114 Oxidation–reduction process <1.729E27 GO:0009733 Response to auxin <1.729E27 GO:0030244 Cellulose biosynthetic process 2.2E22 GO:0006508 Proteolysis 1.2E17 GO:0006887 Exocytosis 3.9E06 GO:0006855 Drug transmembrane transport 4.7E05 GO:0003333 Amino acid transmembrane transport 0.00048 GO:0005992 Trehalose biosynthetic process 0.00064 GO:0048544 Recognition of pollen 0.00064 GO:0005975 Carbohydrate metabolic process 0.00453 GO:0071577 Zinc II ion transmembrane transport 0.00939 Table 2. Enriched Biological Functions of Rapidly Evolving Protein-Coding Gene Families in C. argyrosperma. 510 Molecular Plant 12, 506–520, April 2019 ª The Author 2019. Molecular Plant The Genome of Cucurbita argyrosperma lincRNA family analysis to assess whether the patterns observed within Cucurbita were determined by the effect of lincRNA turn- over throughout the phylogeny (Ulitsky, 2016). We retrieved 5466 C. argyrosperma lincRNA families, of which 67.7% showed evidence of protein-coding gene neofunctionali- zation throughout their phylogenies, while only 32.3% were exclusively composed of noncoding elements. In contrast to the C. argyrosperma lincRNA gene families, we found that 34.3% of the 5231 C. lanatus lincRNA families had evidence of protein-coding neofunctionalization, while 65.7% were exclu- sively composed of noncoding elements. To assess the level of lincRNA conservation across the Cucurbitaceae family, we only used the subset of lincRNAs whose families were solely composed of noncoding elements, since protein-coding-derived lincRNAs can be traced to homologous protein-coding genes in distantly related taxa, therefore leading to overestimation of the conservation of lincRNAs throughout the phylogeny. This anal- ysis showed that the conservation of lincRNAs between C. argyr- osperma and the rest of the analyzed species steadily declines with phylogenetic distance, with an average of 2.27% of lincRNA homologs lost per million years (Figure 3A). 75.9%–92.4% of the lincRNAs found in C. argyrosperma had at least one homologwithin the genomes of the otherCucurbita spe- cies, with an average lincRNA loss rate of about 2.4% per million years. Just 4.3%–8.8% of the lincRNA families had homologs within thegenomesof speciesbelonging toBenincaseae,whereas 6.5% of the lincRNA families had homologs within the genome of M. charantia, which is a higher percentage than that observed in somespeciesbelonging toBenincaseae. Just 1.2%of the lincRNA families in J. regia and 0.3% of those in in F. vesca were retained, and only five lincRNA homologs were conserved in both species. Despite the general trend between phylogenetic distance and lincRNA loss, the average rate of lincRNA loss between C. argyr- osperma and Benincaseae increased to 2.8% per million years and then declined to around 2.1% per million years inM. charan- tia and to 0.96% per million years in the outgroup species. The decline in lincRNA conservation with respect to phylogenetic dis- tance could also be observed for C. lanatus lincRNAs, with an average loss rate of 2.54% per million years in the whole phylogeny (Figure 3B), although the percentage of retained lincRNAs in J. regia and F. vesca was higher (2.3% and 0.7% respectively). The average rate of lincRNA loss between C. lanatus and Cucurbita was also around 2.8% per million years. However, the lincRNA loss rate within Benincaseae was approximately 3.4% per million years, the highest rate observed in the study. The loss rate also declined in M. Charantia to around 2% per million years and to 0.95% per million years in the outgroup species. SomeC. argyrosperma lincRNA families showed a high degree of conservation within Cucurbitaceae, that is, every species had at least one representative gene within the family (1016 families). While most of these conserved lincRNA families had at least one protein-coding gene within its phylogeny (95.17%), a small percentage of the families showing a high degree of conservation were composed of putatively noncoding elements (4.82%). We found that seven of these putatively noncoding families were pre- sent as single-copy orthologs within Cucurbitaceae, and the members of these families were mostly predicted independently as lincRNAs with Evolinc-I, that is, using transcriptional evidence (Figure 4A). The five lincRNAs that were shared between C. argyrosperma and both outgroup species showed complex evolutionary histories, with several instances of duplications and losses, and none of them were conserved in all the analyzed species (e.g., Supplemental Figure 5). Our analyses show a high rate of lincRNA family birth within the Cucurbita genus (Figure 4B). 55.94% of the lincRNA families were exclusively found in the Cucurbita genus. Of these gene families, 78.87% were present in all four Cucurbita species and just 1.7% were exclusive to C. argyrosperma. We found a similar pattern for the C. lanatus lincRNA families, where 64.61% were exclusively found in Benincaseae. However, 47.24% of these lincRNA families were exclusive to C. lanatus, and only 10.47% were present in all four species. Many of the C. argyrosperma lincRNA families (61.2%) showed signals of gene duplication within the Cucurbita genus. Many of these families showed symmetric expansion within Cucur- bita (1948 families), that is, expansion where the number of lincRNA genes remained constant within Cucurbita but at least two-fold higher with respect to the rest of the Cucurbitaceae species (Figure 4C). This pattern is not a product of the phylogenetic distance between Cucurbita species, as it could also be seen within the Cucurbita clade when analyzing the C. lanatus lincRNA families (53 families, Supplemental Figure 6). Most of the lincRNA families with symmetric expansion within Cucurbita also showed evidence of protein- coding neofunctionalization (62.26%). The phylogenies of some of these families showed a duplication event corre- sponding to the whole-genome duplication in Cucurbita, where a protein-coding family gave rise to a lincRNA clade Figure 2. Differences in the Proportion of lincRNAs Associated with Protein-Coding Transcripts (Red) and Transposable Elements (Blue) between Four Cucurbita Genomes (C. argyrosperma, C. moschata, C. maxima, and C. pepo) and Five Genomes of Other Cucurbit Species (C. sativus, C. melo, C. lanatus, L. siceraria, and M. charantia). Cucurbita spp. show a higher proportion of protein-coding gene-derived lincRNAs compared with other members of the same family (p = 0.041), whereas the proportion of transposable element (TE)-derived lincRNAs is lower in Cucurbita compared with the other cucurbit species (p = 0.013). Molecular Plant 12, 506–520, April 2019 ª The Author 2019. 511 The Genome of Cucurbita argyrosperma Molecular Plant (Figure 4D). After carefully inspecting one of these lincRNA families (Carg_TCONS _00015392), we found high levels of sequence conservation between the Cucurbita lincRNAs, although no conservation of the length, the ORF, or the codon structure of its protein-coding homolog was observed (Figure 5A). We also found a thermodynamically stable secondary structure in the lincRNA (Figure 5B and 5D), whereas the homologous protein-coding transcript showed lower structure stability, as well as signs of many equally sta- ble structures throughout the transcript (Supplemental Figure 7), despite both the lincRNA and protein-coding tran- script having similar dinucleotide frequencies (Supplemental Table 4). The structural stability of this lincRNA is even higher than that of one of the highly conserved lincRNAs found in single copy throughout the Cucurbitaceae (Carg_TCONS_00063022; Figure 5C and 5E). DISCUSSION The protein-coding gene content within the Cucurbita species has remained relatively constant, despite the whole-genome duplication that happened around 30 ± 4 Mya (Montero-Pau et al., 2017). However, our results indicate that this genome duplication event had a profound effect on the gene evolutionary dynamics within Cucurbita, namely, higher rates of protein-coding gene family evolution and higher rates of B A Figure 3. lincRNA Conservation across the Cucurbitaceae Family. Each column alongside the phylogeny represents the percentage of homologs found within each cucurbit genome for the subset of lincRNAs without homology to the predicted protein-coding genes in the genomes of (A) Cucurbita argyrosperma and (B) Citrullus lanatus. The percentage of conserved lincRNAs between species declines drastically as the phylogenetic distance becomes larger, due to a high rate of lincRNA turnover (gene birth/death rate). Only a small fraction of lincRNAs are conserved between Cucurbitaceae and the outgroup species. The average rate of lincRNA loss per million years (right) is shown for the following clades: Cucurbita (red), Benincaseae (green), Momordica charantia (purple), and the outgroup species (gray). 512 Molecular Plant 12, 506–520, April 2019 ª The Author 2019. Molecular Plant The Genome of Cucurbita argyrosperma coding-to-noncoding gene neofunctionalization. This idea is further supported by the concordance between the estimated age of the whole-genome duplication and the divergence time between Cucurbita and Benincaseae, as observed in our dated phylogeny. The divergence times estimated in this study mostly fall within the 95highest posterior density of previous phyloge- netic studies, although our estimated ages within the Cucurbita genus are slightly older (Schaefer et al., 2009; Castellanos- Morales et al., 2018). The whole-genome duplication within Cucurbita seems to be responsible for the observed acceleration in the rate of protein- coding gene family evolution, as almost half of the gene families experienced either expansions or contractions in the branch lead- ing to the crownnodeof theCucurbitagenus,witha higher propor- tion of expansions than contractions. Furthermore, the terminal branches in the Cucurbita genus also showed larger proportions of gene family expansions and contractions compared with most of the other cucurbit taxa. Finally, the rate of gene family birth/ death was two times higher in the Cucurbita clade compared with the rest of the phylogeny. All this evidence points toward a higher rate of gene family evolution in Cucurbita after the whole- genome duplication event that happened around 30 Mya (Montero-Pau et al., 2017). These patterns were observed despite performing statistical corrections for genome assembly and annotation errors during our gene family analyses, and were also observed after filtering low-quality gene models, suggesting that our results are robust to possible errors in genemodel predic- tions. Furthermore, the number of gene models obtained after filtering low-quality gene models suggests that the real number of protein-coding genes in Cucurbita may be closer to 28 000 (Montero-Pau et al., 2017) than 32 000 genes (Sun et al., 2017). The terminal branch of J. regia showed a high proportion of gene family expansion. These gene families have been previously shown to be involved in the biosynthesis of nonstructural poly- phenols (Martı́nez-Garcı́a et al., 2016). The genome of J. regia also shows evidence of a whole-genome duplication that happened around 60 Mya (Luo et al., 2015; Martı́nez-Garcı́a et al., 2016), further suggesting that whole-genome duplications are correlated with higher rates of gene family evolution. The results of GO enrichment analysis of the rapidly evolving fam- ilies suggest that several biological functions changed inC. argyr- osperma in relation to other Cucurbita species. For instance, we found a significant contraction in gene families associated with the recognition of pollen as well as pectinesterases, which are involved in pollen tube growth during pollination (Tian et al., 2006). Both kinesin motor protein and villin/gelsolin protein families, which expanded in C. argyrosperma, are also involved in pollen tube growth during pollination (Su et al., 2007; Li et al., 2012). Contraction and expansion of these families could be A B DC Figure 4. Patterns of lincRNA Family Evolution throughout the Cucurbita Genus. (A) Presence of single-copy orthologous lincRNAs with a high degree of conservation throughout the Cucurbitaceae family suggests these lincRNAs have an essential biological function. (B) Sudden duplication bursts (red diamonds) within theCucurbita genus, as well as multiple gene losses (dotted branches), reveal a high rate of lincRNA turnover. (C) Duplication of lincRNA families associated with the whole-genome duplication in Cucurbita (red diamond). (D)Neofunctionalization of protein-coding genes (blue bar) into the novel Carg_TCONS_00015392 lincRNAs (red bar) after the whole-genome duplication in theCucurbita genus (red diamond). Some lincRNAs were independently predicted based on homology using Evolinc-II (triangles in terminal nodes) and based on transcriptomic evidence using Evolinc-I (circles in terminal nodes), further supporting the transcription of these genes. The colors at the terminal nodes indicate the species to which each gene belongs. Molecular Plant 12, 506–520, April 2019 ª The Author 2019. 513 The Genome of Cucurbita argyrosperma Molecular Plant associated with changes in reproductive isolation, since reproductive barriers that prevent hybridization are more stringent in C. argyrosperma than in its sister species, C. moschata (Hurd et al., 1971). Pectinesterases are also involved in cell-wall modification during fruit ripening, and changes in the pectinesterase family could explain the differences in the smoothness of the fruit flesh between C. argyrosperma and C. moschata. The reduction in the number of pectinesterases within C. argyrosperma could have had an impact in its domestication, since fewer genes were available for artificial selection to act upon, possibly restricting its use to seed consumption, unlike the rest of the domesticated Cucurbita species whose ripened fruit flesh is commonly consumed (Lira et al., 2016). A future comparison between the genomes of C. A B D C E Figure 5. Manual Assessment of Primary and Secondary Structure in Some lincRNAs Found within the Genome of Cucurbita argyrosperma. (A)Part of amultiple sequence alignment between a protein-coding gene-derived lincRNA (Carg_TCONS_00015392; down) and the homologous protein- coding gene transcripts (trafficking protein particle complex subunit 11; up). Each codon in the alignment is shown in a different color. Carg_T- CONS_00015392 lincRNAs do not contain an open reading frame or show codon conservation, but orthologous lincRNA sequences are conserved between species. (B–C)Minimum free energy (MFE) structural predictions of (B) the protein-coding-derived lincRNA Carg_TCONS_00015392 and (C) the highly conserved lincRNA Carg_TCONS_00063022. (D–E) RNA base-pairing probability matrices showing the MFE structural prediction (below the diagonal) and all possible suboptimal pairings (above the diagonal) for (D) Carg_TCONS_00015392 and (E) Carg_TCONS_00063022. Higher probabilities are represented as larger dots within the matrices. 514 Molecular Plant 12, 506–520, April 2019 ª The Author 2019. Molecular Plant The Genome of Cucurbita argyrosperma argyrosperma ssp. argyrosperma and C. argyrosperma ssp. sororia will reveal whether this reduction in gene family size happened before or after the domestication of this species. Several contracted families were functionally enriched in exocy- tosis and transmembrane transport functions, which are usually involved in the release of secondary metabolites, hormones, and numerous other compounds (Hedrich and Marten, 2006). Interestingly, different multi-antimicrobial extrusion protein families either expanded or contracted within the genome of C. argyrosperma. This suggests an adaptive transition between different families of multi-antimicrobial extrusion proteins, perhaps related to changes in the geographic distribution of C. argyrosperma from its ancestors (Castellanos-Morales et al., 2018). Since multi-antimicrobial proteins are usually involved in the removal of cytotoxic compounds (Eckardt, 2001), the levels of these compounds could change alongside the geographic distribution of the species, acting as selective pressures in these gene families. Some of the observed evolutionary dynamics within the lincRNA families in Cucurbitaceae can be explained under a high-turnover model of lincRNA evolution, such as the decline in lincRNA con- servation as a function of phylogenetic distance and sudden duplication bursts. These dynamics were observed in both the C. argyrosperma and C. lanatus lincRNA families, suggesting that gene duplication is a common mechanism of lincRNA birth within the Cucurbitaceae family. The average rate of lincRNA loss observed in the Cucurbitaceae family is similar to that observed in the Brassicaceae family, around 2.47% per million years (Nelson et al., 2016). However, considerable variation can be observed within each clade in both phylogenies. In the case of Cucurbitaceae, the loss rate ranged from 2.8% per million years between Cucurbita and Benincaseae to 2.1% per million years between Cucurbita and M. charantia and 3.4% per million years within Benincaseae. The differences in loss rate between Brassicaceae species are even more drastic, ranging from 4.3% per million years between Arabidopsis and Capsella to 2% per million years between Arabidopsis and Brassica (Nelson et al., 2016). Even though the total number of shared lincRNAs decreases with phylogenetic distance, the loss rate seems to decrease between distantly related taxa. In the case of Brassicaceae, the loss rate declined to 1.5% per million years between Arabidopsis and Cleomaceae, which diverged 64 Mya (Nelson et al., 2016). In the case of Cucurbitaceae, we found that the loss rate declined to 0.96% per million years between Cucurbitaceae and the outgroup species. This decline could be explained by a survivor bias, whereby the most biologically important genes are conserved throughout distantly related taxa, thus slowing the rate of lincRNA loss per million years. The rate of lincRNA loss within Tetrapoda seems to be slower than in plants, as 3% of the lincRNAs in humans are also present in chickens, which diverged 300 Mya (Necsulea et al., 2014). The loss rate between Cucurbita and Benincaseae was higher than the rate within Cucurbita, as expected by the effect of the whole-genome dupli- cation (Nelson and Shippen, 2015). However, the high loss rate within Benincaseae was unexpected, since it was higher than that observed between Cucurbita and Benincaseae. We propose that the acceleration in the rate of lincRNA turnover within Benincaseae was caused by the multiple changes in karyotype number throughout Benincaseae (Huang et al., 2009; Levi et al., 2011; Garcia-Mas et al., 2012; Wu et al., 2017), which are considered genomic disturbances that can accelerate this rate (Nelson and Shippen, 2015). This hypothesis is supported by the larger proportion of conserved lincRNAs between C. argyrosperma and M. charantia than between C. argyrosperma and Cucumis, which can be explained by additional genomic disturbances within the Benincaseae family. The decline in lincRNA conservation throughout the Cucurbita- ceae phylogeny could be explained by high levels of gene birth and death, making the search for homology between distantly related species futile, as the vast majority of these genes either arose before the divergence of both taxa or became extinct in one of the lineages (Ulitsky, 2016). It is also possible that lincRNAs have high rates of nucleotide substitution due to positive selection (Smith and Mattick, 2017), which hinders the search for homologous sequences as species become more distantly related. Given that lncRNAs are involved in epigenetic regulation through several mechanisms (Mercer et al., 2009), the possibility of positive selection acting on such dynamic genes may be an important factor in adaptive radiation (Smith and Mattick, 2017). The observation of highly conserved lincRNAs, as well as the ev- idence of their transcription suggests they have an important biological function within the Cucurbitaceae family. However, pinpointing the specific biological functions of lincRNAs is still difficult without experimental data. The high rate of turnover in lincRNA families limits the inference of biological functions from distantly related plants such as Arabidopsis thaliana, in which experimental validation of gene functions is more common (Ulitsky, 2016). Future experimental studies should focus on the functional characterization of these lincRNAs. The proportion of Evolinc-II lincRNA families with evidence of protein-coding neofunctionalization was higher than initially ex- pected based on our direct comparison between lincRNAs and protein-coding genes. Such events are not exclusive of theCucur- bita crown node, as they are rather common throughout theCucur- bitaceae family, although they are particularly frequent within the Cucurbita clade. This suggests that the neofunctionalization of protein-coding genes into novel lincRNAs is more common than initially suspected (Kapusta et al., 2013), and may be a recurrent source of noncoding genes in the Cucurbitaceae family (Chen and Rajewsky, 2007) alongside other sources of lincRNA genes, such as neofunctionalization from TEs (Kapusta et al., 2013). The proportion of lincRNA families with evidence of protein- coding neofunctionalization predicted from comparisons with the C. argyrosperma lincRNAs was higher compared with those predicted from comparisons with the C. lanatus lincRNAs. Furthermore, the proportion of protein-coding gene-derived lincRNAs calculated from the direct comparison between coding transcripts and lincRNAs was significantly higher in Cucurbita compared with the proportion in other cucurbit species, whereas theproportion of TE-derived lincRNAswas lower inCucurbitawith respect to the rest of theCucurbitaceae family. These results sug- gest that the whole-genome duplication in Cucurbita acted as a genomic disturbance that altered lincRNA family birth dynamics, Molecular Plant 12, 506–520, April 2019 ª The Author 2019. 515 The Genome of Cucurbita argyrosperma Molecular Plant withmore lincRNAsbeingderived fromprotein-codinggenes than from TEs (Kapusta et al., 2013; Nelson and Shippen, 2015). This is consistent with the apparent conservation of TE proportions between Cucurbita species, suggesting that TEs did not play an important role in lincRNA family evolution within Cucurbita after the whole-genome duplication, unlike in other taxa such as verte- brate species, where TEs play an important role in lincRNA birth (Kapusta et al., 2013). Our results also differ from those obtained in cotton, where a larger proportion of lincRNAs were homologous to TEs than to protein-coding loci (Zhao et al., 2018). This suggests that the evolutionary dynamics of lincRNAs can change drastically between different plant taxa. After the whole-genome duplication within Cucurbita, many duplicated protein-coding genes that were functionally redundant were co- opted into novel lincRNAs (Ponting et al., 2009; Kapusta et al., 2013). The observed differences in length between Carg_TCONS_ 00015392 and its protein-coding homologs, as well as the disrup- tion of the ORF and the lack of codon structure, suggest this lincRNA is not a cryptic protein-coding transcript but a true non- coding element in Cucurbita. Furthermore, the level of sequence conservation between species, as well as the thermodynamic stability of its predicted secondary structure, suggests that this lincRNA may be functional (Smith and Mattick, 2017). Even though secondary structure alone is insufficient evidence to support the hypothesis that a noncoding RNA is functional, especially considering that some lincRNAs have more than one functional structure (Smith and Mattick, 2017), secondary structures in functional noncoding RNAs are expected to be more stable than those in other sequences with similar compositions (Clote et al., 2005). This can be observed when comparing the stability between Carg_TCONS_00015392 and its protein-coding homolog, both of which have similar dinucleo- tide frequencies but different structural stabilities. This structural stability is also present in the highly conserved lincRNA Carg_T- CONS_00063022. The stability of the predicted secondary struc- tures in both lincRNAs suggests that they are functional (Clote et al., 2005), unlike the secondary structure in the transcript of the protein-coding gene homologous to Carg_TCONS_ 00015392, which appears to be random. Whether all predicted lincRNAs behave similarly or whether they are indeed functional remains to be experimentally validated. We propose that the whole-genome duplication within theCucur- bita genus allowed for faster rates of gene family evolution, since functional redundancy within the genome facilitated the co- option of complex genetic elements, such as previously existing genes, into new functions. During the fractionation process after the whole-genome duplication (Sun et al., 2017), a substantial fraction of the duplicated protein-coding genes with redundant functions either diverged (acquiring novel functions as protein- coding genes, thereby increasing the rate of gene family birth/ death), or neofunctionalized into noncoding elements, such as lincRNAs. METHODS Biological Samples and DNA/RNA Extraction We obtained seeds from a cultivated individual of C. argyrosperma ssp. argyrosperma collected in the region of Tepec, Jalisco (see Supplemental Data 1 for detailed methods and data on fruit selection). One of the seeds was germinated in a greenhouse, and plants were grown to maturity, when flower buds started to develop. We selected one of the germinated plants and extracted total DNA from fresh leaves (Doyle and Doyle, 1987) for genome sequencing. For transcriptome sequencing, we extracted total RNA from leaves, stems, roots, male flower buds, and tendrils using the RNeasy Plant Mini Kit (Qiagen) according to the manufacturer’s protocol. Each RNA sample was precipitated in salty ethanol (260 mM lithium chloride and 66% EtOH). The plant used for whole-genome sequencing and transcriptome sequencing was deposited in the National Herbarium of Mexico (MEXU) under accession number SMH-JMG-627. The details of DNA and RNA sequencing are available in Supplemental Methods. Genome and Transcriptome Assembly The chloroplast genome of C. argyrosperma was assembled with NOVOPlasty (Dierckxsens et al., 2016), and the mitochondrion genome was assembled using the Organelle-PBA pipeline (Soorni et al., 2017). Both organelles were scaffolded using SSPACE-longread (Boetzer and Pirovano, 2014). Gap-filling and base corrections were performed with Pi- lon (Walker et al., 2014). The nuclear genomewas assembled with a hybrid approach, using Platanus (Kajitani et al., 2014) and DBG2OLC (Ye et al., 2016). We used Minimap and Racon (Vaser et al., 2017) to obtain a consensus sequence assembly, then base corrections were made using Pilon. Scaffolding was done using BESST (Sahlin et al., 2014) and SSPACE-longread. Gap closing was performed with GapFiller (Boetzer and Pirovano, 2012) and a final base correction was done with Pilon. See Supplemental Methods for a detailed description of the organelle and nuclear genome assemblies and a description of the transcriptome assembly. The Illumina and PacBio sequence reads were mapped against the genome using BWA mem (Li, 2013) and BlasR (Chaisson and Tesler, 2012), respectively, to assess the completeness of the genome assembly. We mapped the transcriptome reads against the nuclear and organelle genomes using Hisat2 (Kim et al., 2015) to assess assembly completeness. The percentage of reads that mapped to the assembly was calculated using flagstat within SAMtools (Li et al., 2009). Prediction of Transposable Elements and Protein-Coding Gene Models We used the REPET package (Flutre et al., 2011) to predict de novo the TEs within the C. argyrosperma genome assembly, generating a library of non-redundant consensus sequences. These consensus sequences were classified according to Wicker’s classification system (Wicker et al., 2007) using PASTEC (Hoede et al., 2014) within the REPET pipeline (repeat library available in Supplemental Data 2). The repeat library was used to annotate and mask the TEs within the genome assembly with RepeatMasker (Smit et al., 2013). MAKER3 (Cantarel et al., 2008) was used to predict protein-coding gene models in the C. argyrosperma genome assembly. We incorpo- rated AUGUSTUS (Stanke et al., 2006) GeneMark-ES (Lomsadze et al., 2005) and SNAP (Korf, 2004) as ab initio gene predictors within MAKER3. We also used EvidenceModeler (Haas et al., 2008) to obtain additional gene models within MAKER3. We incorporated tRNAscan-SE (Lowe and Eddy, 1997) within the MAKER3 pipeline to predict tRNA genes. See Supplemental Methods for a detailed description of the prediction of protein-coding gene models. The protein-coding genes predicted within C. argyrosperma were function- ally annotated with InterProScan (Jones et al., 2014). The annotation table is available in Supplemental Data 4. 516 Molecular Plant 12, 506–520, April 2019 ª The Author 2019. Molecular Plant The Genome of Cucurbita argyrosperma Phylogenetic and Protein-Coding Gene Family Analyses The details of the phylogenetic analyses can be found within Supplemental Methods. We performed an all-VS-all BLASTp (Camacho et al., 2009) analysis to identify protein-coding gene families with MCL (Enright et al., 2002), using an inflation parameter of 3. All gene families were aligned with MAFFT (Katoh et al., 2002). We generated the species phylogeny with PhyML (Guindon et al., 2010), using SMS (Lefort et al., 2017) to determine the best amino acid substitution model for our sequence alignment. To obtain a dated phylogeny, we used a Bayesian Markov chain Monte Carlo approach with approximate likelihood calculation, as implemented in mcmctree (Yang, 2007). The mcmctree trace files are available in Supplemental Data 3. We assessed changes in protein-coding gene family sizes across the dated phylogeny using CAFE v4.0.2 (De Bie et al., 2006). We initially estimated the gene birth–death parameter l to assess gene family evolu- tion using a subset of gene clusters that had <100 differences in gene con- tent between any pair of species and used it to calculate significant changes (p value <0.01) in gene family size at every branch in the dated phylogeny for every gene family. We compared three different l schemes: (a) a change in l within Cucurbi- taceae, (b) a change in l within Cucurbita, and (c) two changes in l, one within Cucurbitaceae and another within Cucurbita. After finding the best scheme of l parameters within the phylogeny, we estimated error models for genome assembly and annotation errors (Han et al., 2013), and used those models to analyze significant gene family expansions and contractions throughout the tree. We performed the clustering, molecular clock, and gene family analyses using the same methodology as described above using a subset of high-quality protein-coding gene models obtained after filteringMAKER predictions with eAED values lower than 0.5 (input files, CAFE scripts, and final outputs are available in Supplemental Data 5). We performed GO enrichment analyses with topGO using the weight01 method (Alexa et al., 2006). Significantly enriched terms were assessed after performing Fisher’s exact tests and performing false discovery rate (FDR) adjustments of the p values (Benjamini and Hochberg, 1995). Prediction and Analysis of Long Noncoding RNAs We used Evolinc-I (Nelson et al., 2017) to predict lincRNAs within the genome-guided transcriptome assembly of C. argyrosperma. In brief, Evolinc-I predicted as lincRNAs all transcripts longer than 200 bp that did not overlap with any of the predicted protein-coding genes within the genome and did not contain an ORF longer than 100 amino acids (Nelson et al., 2017). lincRNAs were also predicted for the genomes of C. maxima, C. moschata, C. pepo, C. melo, C. sativus, C. lanatus, and L. siceraria using the same methodology as described above, using RNA-seq data available in the Sequence Read Archive (SRA, https:// www.ncbi.nlm.nih.gov/sra). SRA accessions used for each species are available in Supplemental Table 3. The lincRNAs of M. charantia were extracted from the gff3 file available under the NCBI RefSeq genome accession PRJNA433137 (Urasaki et al., 2017). We used BLASTn (Camacho et al., 2009) with a cutoff of 50% coverage and 30% identity to define similarity due to sequence homology between lincRNAs, protein-coding transcripts, and TEs. We performed Student’s t-tests with the stats package in R (R Core Team, 2016) to assess statistical differences in the proportion of protein-coding gene-derived lincRNAs and TE-derived lincRNAs between Cucurbita and the other cucurbit species. We used Evolinc-II (Nelson et al., 2017) to assess the evolution of lincRNA families across the Cucurbitaceae family. Homologous sequences for each query lincRNA were filtered using an e-value of <1e20 after a reiterative reciprocal BLAST search against each genome. We used MAFFT (Katoh et al., 2002) to align each lincRNA family and RAxML (Stamatakis, 2014) to generate gene family trees. Each gene family tree was compared against the species tree to detect duplication or loss of lincRNAs across the phylogeny using Notung (Chen et al., 2000). Since there is a possibility that the predicted lincRNAs are protein-coding genes that could not be predicted with MAKER3, we wanted to exclude as many false positives as possible to obtain a conservative estimation of the num- ber of lincRNA families that evolved from protein-coding genes. Thus, we eliminated 648 possible spurious lincRNAs (that is, protein-coding genes that were mistakenly predicted as lincRNAs) where the only noncoding el- ements within the gene family belonged to the query species, and the other elements were protein-coding genes predicted from the other spe- cies. Likewise, we defined lincRNA families with evidence of protein- coding gene neofunctionalization as those with noncoding elements from two ormore different species within the phylogeny, as the rate of par- allel misidentification of lincRNA genes in several species should be low. Finally, we defined putatively noncoding families as those that lacked any protein-coding gene within the phylogeny, that is, they were composed exclusively of lincRNA genes. All lincRNA family alignments and phylog- enies are available in Supplemental Data 6 and 7. Rates of lincRNA loss were calculated as the percentage of lincRNAs in the query species without a homolog in another species divided by the mean divergence time between the two taxa. The secondary structure predictions and the base-pairing probability matrix of the lincRNAs were calculated with RNAfold from the ViennaRNA package (Hofacker et al., 1994; Lorenz et al., 2011). ACCESSION NUMBERS The raw sequence reads of all the genomic and transcriptomic data are available in the NCBI SRA database (accession SRP157098). The nuclear and organelle genome assemblies; as well as the protein-coding gene, tRNA, and lincRNA predictions of C. argyrosperma are available in the CoGe database (IDs 53608, 52005, 52006) and in the Figshare database (https://doi.org/10.6084/m9.figshare.7728608.v1). The predicted lincR- NAs of the other species are available in the CoGe database (IDs 52078–52084). The dated species tree and the single-copy ortholog align- ment used for the phylogenetic analyses are available in TreeBase (sub- mission ID S23151). SUPPLEMENTAL INFORMATION Supplemental Information is available at Molecular Plant Online. FUNDING This study was funded by Comisión Natural para el Conocimiento y Uso de la Biodiversidad (CONABIO) KE004 ‘‘Diversidad genética de las espe- cies de Cucurbita en México e hibridación entre plantas genéticamente modificadas y especies silvestres de Cucurbita,’’ CONABIO PE001 ‘‘Di- versidad genética de las especies de Cucurbita en México. Fase II. Ge- nómica evolutiva y de poblaciones, recursos genéticos y domesticación’’ (both awarded to R.L.-S. and L.E.E.), Consejo Nacional de Ciencia y Tec- nologı́a (CONACyT) Investigación Cientı́fica Básica 2011.167826 ‘‘Ge- nómica de poblaciones: estudios en el maı́z silvestre, el teosinte (Zea mays ssp. parviglumis y Zea mays ssp. mexicana)’’ (awarded to L.E.E.), and CONACyT Problemas Nacionales through grant number 247730 (awarded to D.P.). J.B.-R. is a doctoral student from Programa de Doctor- ado en Ciencias Biomédicas, Universidad Nacional Autónoma de México (UNAM) and received fellowship 583146 from CONACyT. The sabbatical leave of L.E.E. at Dr. Peter Tiffin’s laboratory, Department of Plant and Mi- crobial Biology, University of Minnesota, was supported by the program PASPA-DGAPA, UNAM. The Evolinc-II analyses were carried out using CONABIO’s computing cluster, which was partially funded by Secretarı́a de Medio Ambiente y Recursos Naturales (SEMARNAT) through the grant ‘‘Contribución de la Biodiversidad para el Cambio Climático’’ to CONABIO. Molecular Plant 12, 506–520, April 2019 ª The Author 2019. 517 The Genome of Cucurbita argyrosperma Molecular Plant AUTHOR CONTRIBUTIONS J.B.-R., E.I.-L., A.V.-L., S.M.-H., R.L.-S., and L.E.E. jointly conceived and designed the study. G.S.d.l.V. and S.M.-H. collected the biological sam- ples and carried out morphological analyses to select the parental individ- ual of C. argyrosperma used for whole-genome sequencing and assem- bly. J.B.-R., G.S.d.l.V., and A.V.-L. carried out the molecular laboratory work. J.B.-R. performed the genome assembly and gene predictions, and carried outmost of the analyses. E.I.-L. trained the AUGUSTUSmodel for gene prediction and ran REPET for TE prediction. Y.T.G.-G. did the functional annotation of the protein-coding gene families. E.I.-L. and Y.T.G.-G. helped in several bioinformatics analyses. J.B.-R. drafted the manuscript. D.P., R.L.-S., and L.E.E. obtained the funding for the study. All authors revised the final version of the manuscript. ACKNOWLEDGMENTS This manuscript is presented in partial fulfillment of the requirements to obtain a PhD degree by J.B.-R. in the Doctorado en Ciencias Biomédicas, Universidad Nacional Autónoma deMéxico (UNAM).We acknowledge the Doctorado en Ciencias Biomédicas for the support provided during the development of this project. Special thanks to Valeria Souza for support- ing this research, and the technical support of Laura Espinosa-Asuar and Erika Aguirre-Planter. We thank Xitlali Aguirre-Dugua for depositing the plant accession in the herbarium, and all the Laboratorio de EvoluciónMo- lecular y Experimental at Instituto de Ecologı́a, UNAM. We also acknowl- edge the support of the Facultad de Estudios Superiores Iztacala, UNAM, and of the Instituto Nacional de Investigaciones Forestales, Agrı́colas y Pecuarias, Campo Experimental Bajı́o. The authors thank Rodrigo Garcı́a Herrera, head of the Scientific Computing Department at LANCIS, Insti- tuto de Ecologı́a, UNAM, for running the HTC infrastructure we used for part of the analyses. The authors acknowledge the technical support of MSc Emanuel Villafán and the resources for high-performance computing that the Institute of Ecology (INECOL) made available for conducting the genome assembly and annotation reported in this paper. L.E.E. acknowl- edges the support of the program PASPA-DGAPA, UNAM to conduct his sabbatical year. No conflict of interest declared. Received: August 27, 2018 Revised: December 12, 2018 Accepted: December 28, 2018 Published: January 7, 2019 REFERENCES Alexa, A., Rahnenf€uhrer, J., and Lengauer, T. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22:1600–1607. Alverson, A.J., Wei, X., Rice, D.W., Stern, D.B., Barry, K., and Palmer, J.D. (2010). Insights into the evolution of mitochondrial genome size from complete sequences of Citrullus lanatus and Cucurbita pepo (Cucurbitaceae). Mol. Biol. Evol. 27:1436–1448. Ansimova, M., and Gascuel, O. (2006). Approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative. Syst. Biol. 55:539–552. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25:25–29. Bao, W., Kojima, K.K., and Kohany, O. (2015). Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6:11. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57:289–300. Boetzer, M., and Pirovano, W. (2012). Toward almost closed genomes with GapFiller. Genome Biol. 13:R56. Boetzer, M., and Pirovano, W. (2014). SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics 15:211. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. (2009). BLAST plus: architecture and applications. BMC Bioinformatics 10:421. Campbell, M.S., Holt, C., Moore, B., and Yandell, M. (2014). Genome annotation and curation using MAKER and MAKER-P. Curr. Protoc. Bioinform. 48:4.11.1–4.11.39. Cantarel, B.L., Korf, I., Robb, S.M.C., Parra, G., Ross, E., Moore, B., Holt, C., Alvarado, A.S., and Yandell, M. (2008). MAKER: an easy- to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18:188–196. Castellanos-Morales, G., Paredes-Torres, L.M., Gámez, N., Hernández-Rosales, H.S., Sánchez-de la Vega, G., Barrera- Redondo, J., Aguirre-Planter, E., Vázquez-Lobo, A., Montes- Hernández, S., Lira-Saade, R., et al. (2018). Historical biogeography and phylogeny of Cucurbita: insights from ancestral area reconstruction and niche evolution. Mol. Phylogenet. Evol. 128:38–54. Chaisson, M.J., and Tesler, G. (2012). Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13:238. Chekanova, J.A. (2015). Long non-coding RNAs and their functions in plants. Curr. Opin. Plant Biol. 27:207–216. Chen, K., and Rajewsky, N. (2007). The evolution of gene regulation by transcription factors and microRNAs. Nat. Rev. Genet. 8:93–103. Chen, K., Durand, D., and Farach-Colton, M. (2000). NOTUNG: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol. 7:429–447. Chikhi, R., andMedvedev, P. (2014). Informed and automated k-mer size selection for genome assembly. Bioinformatics 30:31–37. Clote, P., Ferré, F., Kranakis, E., and Krizanc, D. (2005). Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA 11:578–591. Daniell, H., Lin, C.-S., Yu, M., and Chang, W.-J. (2016). Chloroplast genomes: diversity, evolution, and applications in genetic engineering. Genome Biol. 17:134. De Bie, T., Cristianini, N., Demuth, J.P., and Hahn, M.W. (2006). CAFE: a computational tool for the study of gene family evolution. Bioinformatics 22:1269–1271. Dierckxsens, N., Mardulyn, P., and Smits, G. (2016). NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 45:gkw955. Doyle, J.J., and Doyle, J.L. (1987). A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochem. Bull. 19:11–15. Edger, P.P., VanBuren, R., Colle, M., Poorten, T.J., Wai, C.M., Niederhuth, C.E., Alger, E.I., Ou, S., Acharya, C.B., Wang, J., et al. (2018). Single-molecule sequencing and optical mapping yields an improved genome of woodland strawberry (Fragaria vesca) with chromosome-scale contiguity. Gigascience 7:1–7. Eckardt, N.A. (2001). Move it on out with MATEs. Plant Cell 13:1477– 1480. Enright, A.J., Van Dongen, S., and Ouzounis, C.A. (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30:1575–1584. Fatica, A., and Bozzoni, I. (2014). Long non-coding RNAs: new players in cell differentiation and development. Nat. Rev. Genet. 15:7–21. 518 Molecular Plant 12, 506–520, April 2019 ª The Author 2019. Molecular Plant The Genome of Cucurbita argyrosperma Flutre, T., Duprat, E., Feuillet, C., and Quesneville, H. (2011). Considering transposable element diversification in de novo annotation approaches. PLoS One 6:e16526. Ganfornina, M.D., and Sánchez, D. (1999). Generation of evolutionary novelty by functional shift. BioEssays 21:432–439. Garcia-Mas, J., Benjak, A., Sanseverino, W., Bourgeois, M., Mir, G., Gonzalez, V.M., Henaff, E., Camara, F., Cozzuto, L., Lowy, E., et al. (2012). The genome of melon (Cucumis melo L.). Proc. Natl. Acad. Sci. U S A 109:11872–11877. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., et al. (2011). Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29:644–652. Guindon, S., Dufayard, J.F., Lefort, V., Anisimova, M., Hordijk, W., and Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59:307–321. Guo, S., Zhang, J., Sun, H., Salse, J., Lucas,W.J., Zhang, H., Zheng, Y., Mao, L., Ren, Y., Wang, Z., et al. (2012). The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions. Nat. Genet. 45:51–58. Haas, B.J., Salzberg, S.L., Zhu, W., Pertea, M., Allen, J.E., Orvis, J., White, O., Buell, C.R., and Wortman, J.R. (2008). Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9:R7. Han,M.V., Thomas, G.W.C., Lugo-Martinez, J., andHahn,M.W. (2013). Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol. Biol. Evol. 30:1987–1997. Hedrich, R., and Marten, I. (2006). 30-year progress of membrane transport in plants. Planta 224:725–739. Hoede, C., Arnoux, S., Moisset, M., Chaumier, T., Inizan, O., Jamilloux, V., and Quesneville, H. (2014). PASTEC: an automatic transposable element classification tool. PLoS One 9:1–6. Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, S., Tacker, M., and Schuster, P. (1994). Fast folding and comparison of RNA secondary structures. Monatsh. F. Chem. 125:167–188. Huang, S., Li, R., Zhang, Z., Li, L., Gu, X., Fan, W., Lucas, W.J., Wang, X., Xie, B., Ni, P., et al. (2009). The genome of the cucumber, Cucumis sativus L. Nat. Genet. 41:1275–1281. Hurd, P.D., Jr., Linsley, E.G., and Whitaker, T.W. (1971). Squash and gourd bees (Peponapis, Xenoglossa) and the origin of the cultivated Cucurbita. Evolution 25:218–234. Jones, P., Binns, D., Chang, H.Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., et al. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics 30:1236–1240. Kajitani, R., Toshimoto, K., Noguchi, H., Toyoda, A., Ogura, Y., Okuno, M., Yabana, M., Harada, M., Nagayasu, E., Maruyama, H., et al. (2014). Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24:1384– 1395. Kapusta, A., Kronenberg, Z., Lynch, V.J., Zhuo, X., Ramsay, L., Bourque, G., Yandell, M., and Feschotte, C. (2013). Transposable elements are major contributors to the origin, diversification, and regulation of vertebrate long noncoding RNAs. PLoS Genet. 9:e1003470. Katoh, K., Misawa, K., Kuma, K., and Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059–3066. Kim, D., Langmead, B., and Salzberg, S.L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12:357–360. Korf, I. (2004). Gene finding in novel genomes. BMC Bioinformatics 5:59. Lefort, V., Longueville, J.-E., and Gascuel, O. (2017). SMS: smart model selection in PhyML. Mol. Biol. Evol. 34:2422–2424. Levi, A., Hernandez, A., Thimmapuram, J., Donthu, R., Wright, C., Ali, C., Wechter, W.P., Reddy, U., and Mikel, M. (2011). Sequencing the genome of the heirloom watermelon cultivar charleston gray. Plant and Animal Genome Conference. P047. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R.; 1000 Genome Project Data Processing Subgroup (2009). The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25:2078–2079. Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv 1303.3997. Li, J., Xu, Y., and Chong, K. (2012). The novel functions of kinesin motor proteins in plants. Protoplasma 249:S95–S100. Lira, R., Eguiarte, L., Montes, S., Zizumbo-Villarreal, D., Colunga- Garcı́aMarı́n, P., and Quesada, M. (2016). Homo sapiens-Cucurbita interaction in Mesoamerica: domestication, dissemination and diversification. In Ethnobotany of Mexico, R. Lira, A. Casas, and J. Blancas, eds. (New York: Springer-Verlag), pp. 389–402. Liu, G., Mattick, J.S., and Taft, R.J. (2013). A meta-analysis of the genomic and transcriptomic composition of complex life. Cell Cycle 12:2061–2072. Liu, X., Hao, L., Li, D., Zhu, L., and Hu, S. (2015). Long non-coding RNAs and their biological roles in plants. Genomics Proteomics Bioinformatics 13:137–147. Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y.O., and Borodovsky, M. (2005). Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 33:6494–6506. Lorenz, R., Bernhart, S.H., H€oner Zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P.F., and Hofacker, I.L. (2011). ViennaRNA package 2.0. Algorithms Mol. Biol. 6:26. Lowe, T.M., and Eddy, S.R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25:955–964. Luo, M.-C., You, F.M., Li, P., Wang, J.-R., Zhu, T., Dandekar, A.M., Leslie, C.A., Aradhya, M., McGuire, P.E., and Dvorak, J. (2015). Synteny analysis in Rosids with a walnut physical map reveals slow genome evolution in long-lived woody perennials. BMC Genomics 16:707. Magadum, S., Banerjee, U., Murugan, P., Gangapur, D., and Ravikesavan, R. (2013). Gene duplication as a major force in evolution. J. Genet. 92:155–161. Martı́nez-Garcı́a, P.J., Crepeau, M.W., Puiu, D., Gonzalez-Ibeas, D., Whalen, J., Stevens, K.A., Paul, R., Butterfield, T.S., Britton, M.T., Reagan, R.L., et al. (2016). The walnut (Juglans regia) genome sequence reveals diversity in genes coding for the biosynthesis of non-structural polyphenols. Plant J. 87:507–532. Mercer, T.R., Dinger, M.E., and Mattick, J.S. (2009). Long non-coding RNAs: insights into functions. Nat. Rev. Genet. 10:155–159. Montero-Pau, J., Blanca, J., Bombarely, A., Ziarsolo, P., Esteras, C., Martı́-Gómez, C., Ferriol, M., Gómez, P., Jamilena, M., Mueller, L., et al. (2017). De novo assembly of the zucchini genome reveals a whole genome duplication associated with the origin of the Cucurbita genus. Plant Biotechnol. J. 12:3218–3221. Necsulea, A., Soumillon, M., Warnefors, M., Liechti, A., Daish, T., Zeller, U., Baker, J.C., Gr€utzner, F., and Kaessmann, H. (2014). The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature 505:635–640. Nee, M. (1990). The domestication of Cucurbita (Cucurbitaceae). Econ. Bot. 44:56–68. Molecular Plant 12, 506–520, April 2019 ª The Author 2019. 519 The Genome of Cucurbita argyrosperma Molecular Plant Nelson, A.D.L., and Shippen, D.E. (2015). Evolution of TERT-interacting lncRNAs: expanding the regulatory landscape of telomerase. Front. Genet. 6:1–6. Nelson, A.D.L., Devisetty, U.K., Palos, K., Haug-Baltzell, A.K., Lyons, E., and Beilstein, M.A. (2017). Evolinc: a tool for the identification and evolutionary comparison of long intergenic non-coding RNAs. Front. Genet. 8:1–12. Nelson, A.D.L., Forsythe, E.S., Devisetty, U.K., Clausen, D.S., Haug- Batzell, A.K., Meldrum, A.M., Frank, M.R., Lyons, E., and Beilstein, M.A. (2016). A genomic analysis of factors driving lincRNA diversification: lessons from plants. G3 (Bethesda) 6:2881–2891. Paris, H.S. (2016). Genetic resources of pumpkins and squash,Cucurbita spp. In Genetics andGenomics of Cucurbitaceae, R. Grumet, N. Katzir, and J. Garcia-Mas, eds. (Cham, Switzerland: Springer), pp. 111–154. Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.-C., Mendell, J.T., and Salzberg, S.L. (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33:290–295. Ponting, C.P., Oliver, P.L., and Reik, W. (2009). Evolution and functions of long noncoding RNAs. Cell 136:629–641. R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org/. Sahlin, K., Vezzi, F., Nystedt, B., Lundeberg, J., and Arvestad, L. (2014). BESST—efficient scaffolding of large fragmented assemblies. BMC Bioinformatics 15:281. Schaefer, H., Heibl, C., and Renner, S.S. (2009). Gourds afloat: a dated phylogeny reveals an Asian origin of the gourd family (Cucurbitaceae) and numerous oversea dispersal events. Proc. R. Soc. B Biol. Sci. 276:843–851. Simão, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V., and Zdobnov, E.M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31:3210–3212. Smit, A., Hubley, R., andGreen, P. (2013). RepeatMasker.Open4.0. http:// www.repeatmasker.org. Smith, M.A., and Mattick, J.S. (2017). Structural and functional annotation of long noncoding RNAs. In Bioinformatics. Methods in Molecular Biology, J.M. Keith, ed. (New York: Humana Press), pp. 65–85. Soorni, A., Haak, D., Zaitlin, D., and Bombarely, A. (2017). Organelle_PBA, a pipeline for assembling chloroplast and mitochondrial genomes from PacBio DNA sequencing data. BMC Genomics 18:49. Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313. Stanke, M., Sch€offmann, O., Morgenstern, B., and Waack, S. (2006). Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMCBioinformatics 7:62. Su, H., Wang, T., Dong, H., and Ren, H. (2007). The villin/gelsolin/fragmin superfamily proteins in plants. J. Integr. Plant Biol. 49:1183–1191. Sun, H., Wu, S., Zhang, G., Jiao, C., Guo, S., Ren, Y., Zhang, J., Zhang, H., Gong, G., Jia, Z., et al. (2017). Karyotype stability and unbiased fractionation in the Paleo-Allotetraploid Cucurbita genomes. Mol. Plant 10:1293–1306. The Gene Ontology Consorsium. (2017). Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 45:D331–D338. Tian, G.W., Chen, M.H., Zaltsman, A., and Citovsky, V. (2006). Pollen- specific pectin methylesterase involved in pollen tube growth. Dev. Biol. 294:83–91. Ulitsky, I. (2016). Evolution to the rescue: using comparative genomics to understand long non-coding RNAs. Nat. Rev. Genet. 17:601–614. Urasaki, N., Takagi, H., Natsume, S., Uemura, A., Taniai, N.,Miyagi, N., Fukushima,M., Suzuki, S., Tarora, K., Tamaki, M., et al. (2017). Draft genome sequence of bitter gourd (Momordica charantia), a vegetable and medicinal plant in tropical and subtropical regions. DNA Res. 24:51–58. Vaser, R., Sovic, I., Nagarajan, N., and Sikic, M. (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27:737–746. Walker, B.J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S., Cuomo, C.A., Zeng, Q., Wortman, J., Young, S.K., et al. (2014). Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9:e112963. Wang, J., Chu, S., Zhu, Y., Cheng, H., and Yu, D. (2015). Positive selection drives neofunctionalization of the UbiA prenyltransferase gene family. Plant Mol. Biol. 87:383–394. Wicker, T., Sabot, F., Hua-Van, A., Bennetzen, J.L., Capy, P., Chalhoub, B., Flavell, A., Leroy, P., Morgante, M., Panaud, O., et al. (2007). A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8:973–982. Wu, S., Shamimuzzaman, M., Sun, H., Salse, J., Sui, X., Wilder, A., Wu, Z., Levi, A., Xu, Y., Ling, K.-S., et al. (2017). The bottle gourd genome provides insights into Cucurbitaceae evolution and facilitates mapping of a Papaya ring-spot virus resistance locus. Plant J. 92:963–975. Yandell, M., and Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nat. Rev. Genet. 13:329–342. Yang, Z. (2007). PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24:1586–1591. Ye, C., Hill, C.M., Wu, S., Ruan, J., and Ma, Z.S. (2016). DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Sci. Rep. 6:31900. Zhao, T., Tao, X., Feng, S., Wang, L., Hong, H., Ma,W., Shang, G., Guo, S., He, Y., Zhou, B., et al. (2018). LncRNAs in polyploid cotton interspecific hybrids are derived from transposon neofunctionalization. Genome Biol. 19:195. Zheng, Y.H., Alverson, A.J., Wang, Q.F., and Palmer, J.D. (2013). Chloroplast phylogeny of Cucurbita: evolution of the domesticated and wild species. J. Syst. Evol. 51:326–334. 520 Molecular Plant 12, 506–520, April 2019 ª The Author 2019. Molecular Plant The Genome of Cucurbita argyrosperma 58 CAPÍTULO 3: ANÁLISIS GENÓMICO DE LA DOMESTICACIÓN DE Cucurbita argyrosperma Artículo de investigación: The domestication patterns of Cucurbita argyrosperma are consistent with early migration events in Mesoamerica and general domestication syndromes in squashes. Este capítulo trata sobre el estudio de la domesticación de Cucurbita argyrosperma utilizando genómica comparada y genómica de poblaciones. Para ello, se secuenció y ensamblo el genoma de la subespecie silvestre C. argyrosperma subsp. sororia con la misma metodología utilizada en el capítulo 1 para el genoma de C. argyrosperma subsp. argyrosperma. Ambos genomas se ensamblaron a nivel de cromosoma y se compararon, encontrando múltiples variantes estructurales asociadas al proceso de domesticación de esta especie. Se colectaron individuos de poblaciones silvestres y domesticadas de C. argyrosperma a lo largo de su distribución en México y se secuenciaron con la técnica de GBS para realizar análisis de demografía histórica y pruebas que permitieran detectar señales de selección. Los resultados sugieren que C. argyrosperma comenzó a ser domesticada en Jalisco hace aproximadamente 13,800 años, poco después de la llegada de los humanos a Mesoamérica. Dicho evento está asociado a una reducción en el tamaño efectivo de las poblaciones domesticadas, aunque el flujo genético con las poblaciones silvestres parece haber mitigado los efectos del cuello de botella. Las pruebas de selección detectaron genes candidatos asociados a características de la subespecie domesticada, tales como la pérdida de tricomas urticantes, el aumento en el tamaño de la planta, la pérdida de latencia en semillas, el aumento en el tamaño de las semillas y la concentración de carotenoides en los frutos. Finalmente, se encontraron señales de selección asociadas a un pseudogen producido por la duplicación completa del genoma en Cucurbita. Este capítulo fue sometido para su publicación y se encuentra en proceso de revisión. 1 The domestication patterns of Cucurbita argyrosperma are consistent with early migration events in Mesoamerica and general domestication syndromes in squashes. Josué Barrera-Redondo°, Guillermo Sánchez-de la Vega°, Jonás A. Aguirre-Liguori, Gabriela Castellanos-Morales, Xitlali Aguirre-Dugua, Salvador Montes-Hernández, Yocelyn T. Gutiérrez-Guerrero, Erika Aguirre-Planter, Maud Tenaillon, Rafael Lira-Saadde*, Luis E. Eguiarte* ° J.B-R. and G.S-V. contributed equally to this work. * To whom correspondence may be addressed. Email: fruns@unam.mx and rlira@unam.mx Abstract Cucurbita argyrosperma is an important Mexican crop whose consumption has been focused to its seeds. It was an important crop for early Mesoamerican cultures and is a useful model to understand Cucurbita domestication, as seeds were probably the first trait to be selected in domesticated Cucurbita. We investigated the domestication history of this species through genome-wide analyses of genetic variants between the wild and domesticated populations ranging its known distribution in Mexico. We sequenced 192 individuals using tGBS libraries and assembled a novel wild reference genome to a chromosome level. Our results indicate that the domesticated subspecies of C. argyrosperma descends from an ancient wild population in Jalisco ˜13,800 years ago under constant gene flow. The demographic patterns of the domesticated populations coincide with the archaeological records and the migration patterns of humans throughout Mesoamerica. We detected several selective sweeps associated with the domestication of C. argyrosperma. We found candidate genes involved in the synthesis and regulation of growth hormones, plant defense mechanisms and phospholipid transportation, possibly associated with its domestication. We found selective signals on several uncharacterized proteins and noncoding transcripts, some of which arose during the whole-genome duplication that happened in the origin of the genus Cucurbita. Significance Statement Cucurbita argyrosperma was an important Mesoamerican crop that is still harvested today in Mexico. Its domestication process helps us understand the early years of plant domestication in America and the importance of this species as an ancient nutritional source of lipids. The analysis of population-level genomic data reveals that the domestication of C. argyrosperma happened soon after humans arrived at Mesoamerica, 13,800 years ago. The domestication process of this species took place in Jalisco, likely alongside maize. We also found signals of selective pressures within the genome of C. argyrosperma that can be attributed to this domestication process. These selective signals are associated with changes that could likely influence fruit size, seed size, seed dormancy and lipid content in seeds. Introduction Domestication is an evolutionary process where an organism (usually humans) selects, modifies and eventually assumes control over the reproduction of another organism, leading in many cases to a mutualistic relationship where the humans can exploit a particular resource of interest while the domesticated organism increases its fitness and its geographical range (1; 2). This is particularly true for the species of the Cucurbita genus 2 (pumpkins, squashes, and gourds), whose domestication process led to their ecological success after its natural dispersers went extinct during the Pleistocene (3). This domestication process has led to the morphological, physiological and molecular differentiation between the domesticated Cucurbita and their wild relatives (4; 5), the latter who now have geographically restricted distributions throughout the American continent (6). Today, the domesticated Cucurbita taxa are successful crops that are consumed worldwide, having a global annual production of around 24 million tons with an estimated value of four billion dollars (7). The Cucurbita genus is composed of ca. 14 species which have experienced at least six independent domestication events (8; 6). Despite being independent domestications, most of the Cucurbita crops share several traits, including the loss of bitter compounds known as cucurbitacins, the loss of defense mechanisms, loss of seed dormancy, gigantism, larger fruits, larger seeds and a high diversity of fruit morphology (7; 5). However, each Cucurbita crop also experienced unique selective regimes. Some good examples are the zucchini variety of Cucurbita pepo whose fruit is consumed at an early developmental stage, while the fruit flesh of C. moschata is consumed at a mature stage in most regions of Mexico but also as an immature fruit in Yucatan (4; 7). These selective regimes were predominantly defined by the nutritional and cultural necessities of early human populations in Mesoamerica (9; 10). The initial steps of Cucurbita crop domestication were probably directed towards seed consumption (11; 12) since they are rich in both carbohydrates and fatty acids (13), and cucurbitacins can be removed through boiling and washing, as is still seen in the consumption of wild Cucurbita seeds in southern Mexico (14). While maize (Zea mays) acted as a main source of carbohydrates and the common bean (Phaseolus vulgaris) was well suited as a source of proteins, Cucurbita seeds were a combined source of carbohydrates and lipids for prehispanic civilizations (10; 13). In this sense, the Mexican crop Cucurbita argyrosperma is a good model to study the early steps of Cucurbita domestication, since its cultivation is directed mainly towards seed production, whereas the fruit flesh is rarely consumed, due to its poor quality compared to other Cucurbita crops, aside from some local specialized varieties (4). C. argyrosperma is composed of two subspecies corresponding to the domesticated taxon (Cucurbita argyrosperma subsp. argyrosperma) and its wild relative (Cucurbita argyrosperma subsp. sororia) (4). Both subspecies are sympatrically distributed throughout the Pacific Coast of Mexico and Central America, with a few populations scattered in the coast of the Gulf of Mexico (4; 15). The domesticated taxon is also distributed in the Yucatan Peninsula, where its wild counterpart is absent (15). The domestication of Cucurbita argyrosperma is presumed to have happened around the Jalisco-Balsas Basin, as hinted by archaeological and genetic evidence (8; 16; 15). The earliest archaeological record of C. argyrosperma was found in the Central Balsas Valley (Guerrero) alongside maize with an estimated age of 8,700 years (16), while the oldest archaeological record of a Cucurbita crop is that of C. pepo, which was found in Guil´a Naquitz (Oaxaca) with an estimated age of 10,000 years (17). Since crop domestication in Mesoamerica is linked to migration patterns and cultural development of early human populations in America (18; 10), it is expected that Mesoamerican crop species share historical demographical patterns with humans. We studied different populations of C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia using genome-wide data alongside a novel genome assembly of C. argyrosperma subsp. sororia and the previously reported reference genome of C. argyrosperma subsp. argyrosperma (19), both of which were assembled to a chromosome level. We investigated the demographic history of the wild and domesticated populations to infer their evolutionary relationships and propose a domestication scenario. We also performed scans to detect selection throughout the genome of C. argyrosperma to detect candidate regions associated with the domestication of this species and the possible link between the candidate genes and the domestication syndromes observed in C. argyrosperma subsp. argyrosperma. 3 Results Genome assembly of C. argyrosperma subsp. sororia We sequenced and assembled the genome of the wild subspecies C. argyrosperma subsp. sororia from an individual located in Puerto Escondido (Oaxaca), to find structural differences when compared to the genome of C. argyrosperma subsp. argyrosperma (19). We assembled the nuclear genome in 828 contigs with an N50 contig size of 1.3 Mbp and an L50 of 58 contigs (Table S1). A BUSCO analysis (20) against the embryophyta odb9 database revealed 92.8% of complete BUSCOs, 1.2% fragmented BUSCOs and 6.0% missing BUSCOs within the genome assembly, similarly to other Cucurbita genome assemblies (21; 22; 19). We predicted 31,452 protein- coding genes within the genome assembly using BRAKER2 (23; 24). This reference genome is 9.23% larger than the genome of C. argyrosperma subsp. argyrosperma (19). Anchoring the reference genomes into pseudomolecules We anchored 99.97% of the C. argyrosperma subsp. argyrosperma genome assembly (19) and 98.8% of the C. argyrosperma subsp. sororia genome assembly into 20 pseudomolecules using RaGOO (25), which corresponds to the haploid chromosome number of Cucurbita genomes (26). Both chromosome assemblies of C. argyrosperma show high synteny conservation across the Cucurbita genus (Fig. S1), revealing a previously reported inversion in chromosome four that is shared with Cucurbita moschata but not with the other Cucurbita species (22). Most of the C. argyrosperma chromosomes also show chromosome-wide homoeologous pairs within the genome assembly (Fig. S1), which have been previously attributed to a whole-genome duplication event in the Cucurbita genus (22; 21). We found several putative rearrangements in the genome of C.argyrosperma subsp. argyrosperma when compared to C. argyrosperma subsp. sororia (Fig. 1) and using C. moschata as an outgroup (Figs. S2 S21). Some of the centromeres in the wild genome were larger than in the domesticated, possibly due to a better assembly of the repetitive regions. We found several copy-number variants (CNVs), inversions, translocations and unalignable regions between the wild and the domesticated genomes (Fig. 1). The wild genome is 9.23% larger than the domesticated one (19), which could be explained by these structural variations. The genes found within the CNV losses in the domesticated genome were enriched in pectinesterases and microtubule-based processes (p-value < 0.05). We also found sucrose-6F-phosphate phosphohydrolase within a CNV loss in the domesticated genome. Even though some regions were unalignable between both genomes, they share some common genes such as microtubule-associated proteins and genes related to tryptophan biosynthesis. However, such unaligned regions contain more genes in the wild genome than in the domesticated, including some proteolytic enzymes and sucrose biosynthetic genes that are not contained in the domesticated genome. Population data and SNP genotyping We collected plant samples across the known distribution of C. argyrosperma in both its domesticated populations (C. argyrosperma subsp. argyrosperma) and wild populations (C. argyrosperma subsp. sororia) throughout Mexico (Table S2). The samples were sequenced using the tunable Genotype by Sequencing (tGBS) method (27) to obtain genome-wide genetic information of the C. argyrosperma populations. The reads were quality-filtered and mapped against the chromosome-level genome assembly of C. argyrosperma ssp. argyrosperma to predict single nucleotide polymorphisms (SNPs) throughout the genome of each individual. We obtained an initial dataset consisting of 12,813 biallelic SNPs with a mean read depth of 50 reads per SNP and a minor allele frequency (MAF) of at least 1% corresponding to 108 individuals of C. argyrosperma subsp. argyrosperma, 44 individuals of C. argyrosperma subsp. sororia, 12 individuals of C. argyrosperma that were 4 previously reported to have a semi-wild phenotype and a cultivated genotype (15) and 5 individuals of C. moschata that were used as outgroups (13k dataset, Dataset S1). Figure 1. Chromosome map representing the matching regions and putative structural variants between the genome assemblies of C. argyrosperma subsp. sororia (wild) and C. argyrosperma subsp. argyrosperma (domesticated). Matching colors represent the aligned homologous regions between both genomes, while white segments represent regions that could not be aligned to the other genome. Inverted regions are highlighted with an asterisk. Demographic history of C. argyrosperma during its domestication For this set of demographic analyses, the 13K dataset was pruned to eliminate SNPs that deviated from Hardy- Weinberg equilibrium (p < 0.01) as well as adjacent SNPs under linkage disequilibrium (r2 > 0.25 in 100 kbp sliding windows) to obtain 2,861 SNPs that could be used to infer the demographic history of C. argyrosperma during its domestication (marker density of 12 SNPs per Mb). We found a slightly higher average genetic variation in C. argyrosperma subsp. sororia compared to C. argyrosperma subsp. argyrosperma (Table 1), suggesting that the possible effects of a domestication bottleneck were alleviated by constant gene flow. At a population scale, the wild population in Jalisco had the highest genetic diversity within C. argyrosperma subsp. sororia, whie the highest diversity in C. argyrosperma subsp. argyrosperma was found in the domesticated populations distributed alongside the Pacific Coast (Table S3). The domesticated and wild populations of C. argyrosperma showed genetic differentiation within the species (FST = 0.0646; 95% confidence interval from 0.0565 to 0.0751), while the feral populations are more closely related to the domesticated populations (FST = 0.0479) than to the wild populations (FST = 0.1006). We used SNPhylo (28) and ADMIXTURE (29) to evaluate the genealogical relationships and genetic structure among the wild and domesticated populations of C. argyrosperma (Fig. 2), finding genetic differentiation between wild and domesticated populations, as previously detected by the FST analyses. The crown node of the 5 domesticated populations was highly supported in the Maximum Likelihood (ML) tree (bootstrap = 83), indicating a single domestication event (Fig. 2A). The wild subspecies showed additional genetic differentiation between the populations in southeastern Mexico and the populations in Jalisco, in both the ADMIXTURE assignations (Fig. 2B) and their positions in the ML tree (Fig. 2A). The wild populations of Jalisco are genetically closer to the domesticated taxon, according to the admixed assignation in the ADMIXTURE test (Fig. 2B) and by their paraphyletic position in the ML tree (Fig. 2A). The domesticated populations in Guerrero and Jalisco appear as the basal branches of the C. argyrosperma subsp. argyrosperma clade (Fig. 2A), all showing instances of genetic similarity to the wild populations in Jalisco in the K = 4 ADMIXTURE assignation (Fig. 2B). Table 1. Average genetic diversity of the wild, domesticated and feral populations of Cucurbita argyrosperma using 2,861 unlinked SNPs (r2 < 0.25, MAF > 1%). (N ind = number of individuals, N pop = number of populations, HO = observed heterozygosity, HE = expected heterozygosity, π = nucleotide diversity, FIS = inbreeding coefficient, Var = variance) Taxon Nind Npop HO (Var) HE (Var) π (Var) FIS (Var) Cucurbita argyrosperma subsp. sororia 44 4 0.1 (0.02) 0.1 (0.01) 0.1 (0.02) 0.01 (0.03) Cucurbita argyrosperma subsp. argyrosperma 109 19 0.09 (0.01) 0.09 (0.01) 0.09 (0.01) 0.03 (0.03) feral populations 14 3 0.1 (0.03) 0.09 (0.02) 0.1 (0.02) -0.02 (0.02) Cucurbita moschata (outgroup) 5 1 0.08 (0.04) 0.06 (0.02) 0.07 (0.03) -0.02 (0.02) The ADMIXTURE results (Fig. 2B) also show admixture events between some wild and domesticated populations, particularly between the populations of Oaxaca and the populations of Sinaloa (Fig. 2C), suggesting gene flow between sympatric populations. Despite the evidence of gene flow in Sinaloa, the feral populations are consistently grouped alongside their sympatric domesticated populations within the domesticated subspecies (Fig. 2A-B), indicating that these populations diverged recently from nearby domesticated populations. The domesticated subspecies show genetic differentiation between the western populations alongside the Pacific Coast in Mexico and the eastern populations distributed around the South Coast of Mexico, the Gulf of Mexico and the Yucatán Peninsula, with a possible recent anthropogenic migration event of eastern populations into Onavas, Sonora (Fig. 2B-C). We used Fastsimcoal 2 (30; 31) to explore whether C. argyrosperma was domesticated from a wild population in southeast Mexico or from a wild population in Jalisco (Fig. 3A) and infer demographic parameters such as effective population sizes, gene flow and the time of the split between the domesticated and wild populations. Given that interbreeding has been previously observed between wild and domesticated Cucurbita taxa (15; 32; 33), we compared three different gene flow models (continuous gene flow, secondary contact or no gene flow) for each scenario (Fig. 3A). The 20 replicates of each model converged to similar likelihoods, indicating that the simulations performed well. Based on the likelihood distribution of the models, the AIC values and the difference between simulated and expected SFS values, we found that the Jalisco domestication scenario with continuous gene flow had the highest likelihood (Tukey’s range test p-value < 0.01; Fig. 3B-C). This model indicates that domestication occurred around 13,829 generations ago with an interval of 12,160 to 22,216 generations within a 1.5x 6 interquartile range (IQR). As this species is strictly annual (4), these generations correspond to years. Our model indicates that in general migration rates were low but occurred between all the genetic groups (ranging between 1E-4 and 2E-4 migrants per generation). However, we observed higher gene flow from the wild Southern populations to the wild populations in Jalisco (m=0.033) and from the wild Southern populations to the domesticated populations (m=0.004). The simulations indicate that C. argyrosperma subsp. argyrosperma had a higher effective population size (N e=1,238) than C. argyrosperma subsp. sororia (N e southern=110, N e Jalisco=105). Figure 2. Genetic structure and phylogenetic relationships between wild and domesticated populations of Cucurbita argyrosperma based on 2,861 SNPs. (A) Maximum Likelihood tree with 100 bootstraps (only bootstrap values > 70 shown) (B) ADMIXTURE analysis using K values ranging from 2 to 4. (C) Geographic distribution of wild (down left) and domesticated (down right) populations, with pie chart colors representing the ADMIXTURE assignation of the individuals in 4K ancestral populations (size of pie charts proportional to sample size). The seed in the maps represent the earliest archaeological record of argyrosperma from Xihuatoxtla, Guerrero (dated 8,700 years BP) (16). 7 Figure 3. Coalescent simulations and most likely domestication scenario of Cucurbita argyrosperma. (A) Six different models were simulated to assess their compatibility with the unfolded multidimensional Site Frequency Spectrum of our data. (B) Comparison of the likelihood of all the models. All models were significantly different based on a Tukey test (p-value < 0.01). (C) Representation of the model that best fits the data, including the estimated parameters of divergence time, migration rates and effective population sizes. Domestication sweeps in C. argyrosperma In order to perform the tests to detect selective sweeps associated with the domestication of C. argyrosperma, we removed from the 13k dataset the C. moschata individuals as well as the feral individuals of C. argyrosperma. We used the same 1% MAF threshold for this subset, obtaining a 10,617 SNP dataset suitable to detect selective sweeps, with a marker density of 44 SNPs per Mb and a considerably fast LD decay throughout the genome (Fig. S22). We performed two FST based tests as implemented in BayeScEnv (34) and PCAdapt (35) to detect selective sweeps between the domesticated and the wild populations of C. argyrosperma (Fig. 4A). 8 We found 334 outlier SNPs by either BayeScEnv or PCAdapt , of which 19 SNPs were predicted as outliers by both tests (Fig. 4A). By assuming a moderate LD within the selective sweeps, we compared the genomic regions 5kp upstream and downstream these 19 candidate SNPs between the wild and the domesticated reference genomes, finding several structural differences that could be attributed to the selective signals (Table S4). We found the arabinogalactan protein AGP18 close to one of these SNPs, which showed two insertions and one deletion in the first exon of the domesticated genome, disrupting its open reading frame (Fig. 4B). We also found several indels between a 17.8 kDa class I heat shock protein (HSP17.8) and an uncharacterized protein, one of whose size was 422 nt long, possibly disrupting an unknown regulatory sequence (Fig. 4C). We found two serine/threonine-protein kinases, PBL10 and PBL23, where the former shows several deletions upstream of the gene (Table S4) and the later shows a 915 nt insertion downstream of the gene, possibly disrupting its transcriptional activity (Fig. 4D). In addition, we found a strong candidate SNP within an intron of a Ribonucleases P/MRP protein subunit (POP1) homolog with a 229 nt insertion upstream of the gene (Fig. 4E). Figure 4. Footprints of selection associated with the domestication of Cucurbita argyrosperma. (A) Manhattan plots representing the BayeScEnv (up) and PCAdapt (down) tests in each chromosome of the genome. The blue line indicates a p-value < 0.05 and the red line indicates a p-value < 0.01 for each of the tests. The 19 SNPs that were considered as outliers by both tests are represented as red dots. (B-E) These 19 outlier SNPs were further analyzed by aligning the domesticated reference genome (green lines) against the wild reference genome (blue line) 5,000 bp in the vicinity of the candidate SNP (see Table S4). Some examples include: (B) the disruption of the AGP18 gene by two insertions and one deletion in the domesticated genome, (C) two large deletions in the domesticated genome between the HSP17.8 gene and an uncharacterized protein, (D) a huge insertion downstream of the PBL23 gene, (E) and a large insertion upstream of a POP1 homolog. 9 We found 125 genes including candidate SNPs by either test within their structure (i.e., introns, exons, UTRs; Table S5). After performing a Gene Ontology enrichment analysis, we found seven significantly enriched biological processes in these 125 candidate genes, including abscisic acid (ABA) biosynthesis and brassinosteroid-mediated signaling (Table S6). Among these candidate genes, we found the homolog of Auxin response factor 1 (ARF1), WRKY transcription factor 2 (WRKY2), ABC transporter C family member 2 and ABC transporter E family member 2. We also found that the zeaxanthin epoxidase activity was significantly enriched in the 125 candidate genes (Table S6), including one of the two copies of zeaxanthin epoxidase within the genome of C. argyrosperma. We detected CLAVATA3/ESR-related protein 10 (CLE10), Tubulin alpha chain (TUBA), Cellulose synthase interactive 1 (CSI1), seven serine/threonine-protein kinases and a phosphatidylinositol/phosphatidylcholine transfer protein (SFH9) among our candidate genes, the last which was also a candidate gene by both BayeScEnv and PCAdapt (Table S4). We looked for the orthologous sequences of the cucurbitacin pathway reported in Cucumis sativus (36) within the C. argyrosperma subsp. argyrosperma genome through a bidirectional BLAST analysis (Table S7). However, we found no evidence of selective pressures on any cucurbitacin-related gene. We found 13 uncharacterized proteins among the 125 candidate genes (Table S5) and three uncharacterized proteins among the convergent selective signals (Table S4). All these uncharacterized proteins were multiexonic and showed Annotation Edit Distances < 0.5 (37), suggesting they are most likely real genes and not annotation artifacts. Eleven of these uncharacterized proteins were found to have homologs among the NR database, revealing their phylogenetic conservation. However, one of these genes (Carg11153) was only found within genomes of the Cucurbitaceae species, finding single orthologs in several cucurbits but in double copy within the Cucurbita genomes (22; 21). Finally, we found an uncharacterized protein (Carg25109) from which we could not retrieve any homolog sequences when comparing it to the NR database, making it a potential de novo gene in C. argyrosperma. We found that eight of the PCAdapt candidate SNPs reside within seven previously described long noncoding RNAs in the genome of C. argyrosperma subsp. argyrosperma (19), as well as two long noncoding RNAs in close proximity to the convergent selective signals (Table S4). Two of these long noncoding RNAs reside within chromosome 3 and are homologous to the RMD5 gene found in chromosome 7 (Fig. S23). Coincidentally, these long noncoding RNAs are found in the same chromosome as another previously described long noncoding RNA with possible signals of purifying selection that is homologous to a trafficking protein particle found in chromosome 7 (19)(Fig. S23). The other five long noncoding RNAs show sequence similarity to either mitochondrial-like or plastid-like genomic sequences but reside within four different chromosomes of the nuclear genome. Most of these transcripts are multi-exonic, suggesting they are not artifacts produced by “transcriptional noise” (38; 39). Furthermore, a BLASTn search of these organelle-like transcripts revealed they were also found in the same nuclear chromosomes of the C. argyrosperma subsp. sororia genome assembly, suggesting they are not a product of genome misassemblies, but rather organellar introgressions into the nuclear genome. Discussion Effects of domestication on genetic diversity The genome assembly of the domesticated subspecies was smaller than that of wild subspecies, which is possibly caused by the loss of structural variants during its domestication. Many of these unaligned regions contain entire genes, making this wild taxon a reservoir of potentially adaptive presence/absence variants. The genetic diversity of the domesticated populations (C. argyrosperma subsp. argyrosperma) is just slightly lower 10 than in the wild populations (C. argyrosperma subsp. sororia), suggesting that the effects of a domestication bottleneck were not as dramatic as in many other crop species (1). The effects of the domestication bottleneck were probably alleviated by the constant genetic flow between both subspecies, as observed in our data and in previous studies (33). This constant gene flow may be related to the sympatric distribution of the wild and domesticated populations of C. argyrosperma throughout the pacific coast of Mexico (15) and by their coevolved pollinators Peponapsis and Xenoglossa in both subspecies (40). In this sense, the phenotype of the feral populations could be attributed to a combination of wild genetic introgression and phenotypic plasticity, rather than fast adaptive changes to the wild conditions. The concept of a drastic domestication bottleneck has been cast into doubt by recent archaeogenetic studies, as this phenomenon may be accentuated latter throughout a long period of time as well as during crop improvement (41). Since C. argyrosperma subsp. argyrosperma is a traditional crop composed mostly of landraces (4), the effects of a genetic bottleneck may not be as harsh as in other crop species. The genetic diversity in C. argyrosperma subsp. argyrosperma is maintained by the traditional agricultural practices and selection criteria performed by Mexican farmers, which promote the conservation of local landrace varieties throughout the country (42). Several studies have shown that traditional agricultural practices are a fundamental force that maintains the diversity of crop species (43; 44). This is also true for Cucurbita species, where traditional agriculture and diverse criteria of human selection promotes the diversity of landraces (45; 46). Furthermore, the wild Cucurbita populations have also been subjected to demographic bottlenecks after the extinction of the Pleistocene megafauna which acted as their natural dispersals (3), previously reducing their genetic diversity in a similar fashion to a domestication bottleneck. This is supported through chloroplast and microsatellite data in Cucurbita pepo, whose wild populations show a lower genetic diversity than their domesticated counterparts due to its small geographic distribution (47). Domestication of C. argyrosperma in Jalisco Our results suggest that the initial population of C. argyrosperma subsp. sororia from which C. argyrosperma subsp. argyrosperma originated was distributed within the state of Jalisco, as revealed by its genetic closeness to C. argyrosperma subsp. argyrosperma. The genetic closeness between the wild populations in Jalisco and the domesticated taxon was also observed with mitochondrial and microsatellite markers (8; 15). Our coalescent simulations also support this scenario, suggesting that C. argyrosperma started its domestication process around 13,800 generations ago, which can be roughly extrapolated to years considering that mesophilic Cucurbita species display an annual life cycle (4). This coalescent event predated the oldest archaeological record of a domesticated C. argyrosperma, around 8,700 years ago (16; 48) and is in rough agreement with the earliest archaeological records of human presence in Mesoamerica, around 13,000 years ago (49), although humans inhabited the continent as early as 18,500 years ago (50). This suggests that C. argyrosperma may have undergone management practices during early human arrival to Mesoamerica (17), leading to a genetic differentiation prior to the fixation of clear domestication traits that were subsequently found on the archaeological records (51; 16; 48). Our data show that the domesticated populations found within Guerrero and Jalisco are the most closely related to the wild population in Jalisco. This means that even though the wild ancestors of C. argyrosperma subsp. argyrosperma came from Jalisco, the domestication process may have occurred throughout the JaliscoBalsas region, bringing these ancestral populations to other states such as Guerrero, where their oldest archaeological remains were found (16). The idea of a domestication center in the Jalisco-Balsas basin has also been previously suggested using mitochondrial and microsatellite markers (8; 15), as well as archaeological data (16). Similarly, ancient human migration events have been documented from Jalisco to the rest of the Balsas Basin between 11 10,000 to 7,000 years ago, probably bringing with them several crop species such as C. argyrosperma, Zea mays, and Phaseolus vulgaris as food sources (16; 18; 52). These migration patterns may explain the genetic cohesiveness between the domesticated populations of C. argyrosperma found throughout the western Pacific Coast in Mexico, representing the first fully domesticated lineages of the species (18). Previous studies based on 8,700 years old phytoliths found in Xihuatoxtla, Guerrero, suggest a parallel domestication of Zea mays and C. argyrosperma in the Balsas-Jalisco region within this time period (16; 48) that may have been disseminated along these human migration events (18). Overall, the genetic patterns of C. argyrosperma structure are coherent with the archaeological evidence of early human migration throughout Mesoamerica (18; 49) (Fig. 5). Figure 5. Model of Cucurbita argyrosperma domestication based on archaeological and genetic data. The colored lines represent the presence of each C. argyrosperma population in Mesoamerica (blue = wild populations, green = domesticated populations in western Mexico, red = domesticated populations in eastern Mexico) Archaeological and paleoclimatic data obtained from (8; 48; 18). The selective sweeps in C. argyrosperma can be traced back to some domestication traits Even though tGBS sequencing has a limited capacity to detect selective sweeps across the genome (53), we found several signals of outlier SNPs between the domesticated and wild populations of C. argyrosperma. The SNP density for our selection tests was of 44 SNPs per Mb, which is one order of magnitude denser than other studies using reduced representation genome sequencing to detect selective sweeps (54). This is a consequence of the relatively small genome size of C. argyrosperma, around 229 Mb (19). Nonetheless, the LD in C. argyrosperma decays at a shorter length than our SNP density, so our genome scans should be interpreted as a partial representation of the selective sweeps associated with the domestication process (53). Fortunately, the use of a wild reference genome helped us investigate the vicinity of the strong candidate SNPs, revealing large insertions or deletions either upstream or downstream of the genes that were close to the region under selection. These structural variants could potentially be linked with the disruption of the transcriptional activity of their nearby genes, as the mutations associated with domestication traits are usually found within cis- regulatory regions, rather than within the gene itself (55). Future studies should include large indels within the selective scans to confirm its role in the appearance of domestication traits. 12 Some SNPs that were retrieved as outlier SNPs in both tests were found near genes involved in biotic and abiotic plant defense responses. For example, HSP17.8 has been proven to activate in response to heat (56), oxidative stress (57), and salt stress (58). Likewise, PBL10 and PBL23 have been suggested to be involved in plant defense pathways due to its similarity to other serine/threonine-protein kinases (59). This is concordant with previous studies showing that induced defense mechanisms are usually selected against during plant domestication, as the products of these responses are usually unpleasant or harmful to humans when the plant is consumed (60). Our Gene Ontology enrichment analysis found several genes enriched in the ABA biosynthesis and brassinosteroidmediated signaling. This suggests that the alteration of growth hormones may play an important role in C. argyrosperma domestication. ABA is involved in a myriad of functions such as the regulation of plant growth, plant development, seed dormancy and response to biotic/abiotic stress (61). In this sense, the lack of dormancy in seeds and gigantism are both common domestication changes (62) that are present in domesticated cucurbits that may be caused by changes in the regulation of ABA and brassinosteroids (63; 5). Particularly, phytohormone regulation may be involved in Cucurbita fruit size alongside microtubule-related genes (5). We found WRKY2 among our candidate genes, which is involved in seed germination and postgermination development in A. thaliana (64) and may explain the lack of seed dormancy in C. argyrosperma subsp. argyrosperma. We also found CLE10 among our candidate loci, which is involved in the regulation of cell fate and proliferation in the apical meristems that is homologous to CLAVATA3 (65). CLAVATA3 and its pathway have been linked to fruit size in tomato (66), as meristem proliferation is correlated to fruit size (67). We also detected two genes involved in microtubule formation, namely TUBA and CSI1 (68; 69), which may play a role in fruit size or other morphological differences between the wild and domesticated subspecies. Among specific candidate genes, we found two ABC transporters under selection in C. argyrosperma. ABC transporters have been known to be involved in the transmembrane transport of ABA-GE, an ABA conjugate that is usually attributed to the plant response against water stress (70). However, studies made in Hordeum vulgare suggest that the transport of ABA-GE may play a role in seed development alongside de novo ABA synthesis within the developing seed (71), suggesting a role of ABC transporters in the seed development of C. argyrosperma. Previous studies have also found an association between variants in ABC transporter proteins and seed size in Cucurbita maxima (72) and Linum usitatissimum (73), further suggesting that ABA may be deregulated via selective pressures on ABC transporters to enhance seed size in C. argyrosperma during its domestication. Additionally, variants in a serine/threonine-protein kinase as the ones we found in our selective scans have also been associated with seed size in Cucurbita maxima (72). We found SFH9 as a candidate gene by both BayeScEnv and PCAdapt, which belongs to a gene family involved in the transportation of phosphatidylinositol and phosphatidylcholine throughout the plant cells (74). Phosphatidylcholine is a major component of the phospholipid content in Cucurbita seeds (75), suggesting phospholipid content could be targeted through artificial selection for its nutritional value (10). We also found a significant enrichment of zeaxanthin epoxidase activity, and a zeaxanthin epoxidase homolog, among our candidate genes. Zeaxanthin epoxidase is linked to the degradation of carotenoids in plant seeds (76), which in turn are important precursors of ABA biosynthesis, regulating processes such as germination and maturation (77). Carotenoid degradation may also be deregulated in fruit tissue, as the domesticated Cucurbita shows a larger concentration of carotenoids in their fruits compared to their wild relatives, giving them their characteristic orange fruit flesh (78). Bitterness is a monogenic trait that differentiates the wild and cultivated subspecies of C. argyrosperma and the rest of the domesticated Cucurbita (51). Even though we expected strong selective pressures against bitterness in C. argyrosperma subsp. argyrosperma, none of the genes associated with the selective sweeps could be traced back to an ortholog of the cucurbitacin pathway reported in Cucumis sativus (36). This may probably be a limitation of our tGBS sequencing data, limiting our ability to detect this selective sweep (54). Thus, future 13 studies should aim to use whole-genome resequencing to obtain a better understanding of the selective pressures in Cucurbita species during domestication. Selective sweeps on novel genomic elements We found 13 uncharacterized genes associated with the domestication sweeps of C. argyrosperma. Most of these proteins are conserved across multiple plant species, suggesting they are important in a broad biological context. However, we also found two uncharacterized genes with a restricted taxonomic presence. De novo genes have been previously shown to be involved in agronomically relevant domestication traits (79), which should be further analyzed directly on the species of interest. Interestingly, we found several long noncoding RNAs under selection, which suggests a few trans-regulatory elements may play a part in C. argyrosperma domestication, as opposed to the usual selection of cis-regulatory elements throughout domesticated plants (55). Selective pressures on long noncoding RNAs have been previously reported on other domesticated crops such as Phaseolus vulgaris (80), indicating the importance of analyzing noncoding transcripts on domestication studies. Two of the long noncoding RNAs with evidence of selective sweeps were found in chromosome 3 and were homologous to the RMD5 gene found in chromosome 7 of C. argyrosperma. Interestingly, chromosomes 3 and 7 share chromosome-wide homoeologous blocks, strongly suggesting that these two overlapping long noncoding RNAs originated from a single pseudogenization process of one of the duplicated copies of RMD5 after the whole-genome duplication event in Cucurbita (22; 21; 19). Despite becoming pseudogenes, these noncoding transcripts show signals of selective sweeps during the domestication process of C. argyrosperma, suggesting they might still be functional. A previously reported long noncoding RNA in Cucurbita argyrosperma (Carg TCONS 00015392) also showed homology to the trafficking protein particle complex subunit 11 (TRAPPC11) and presented both evidence of purifying selection and a thermodynamically stable secondary structure (19). We found that Carg TCONS 00015392 and TRAPPC11 are also found in chromosomes 3 and 7, respectively. This strongly suggests that the fractionation process between chromosomes 3 and 7 led to the pseudogenization of the RMD5 and TRAPPC11 copies in chromosome 3, who are still transcriptionally active. Given the evidence of selection acting on these pseudogenes, they may now play regulatory roles as long noncoding RNAs within the genome of C. argyrosperma (19). The five organelle-like long noncoding transcripts with evidence of selective sweeps reside within different chromosomes of the nuclear genome of C. argyrosperma and most of them are multi-exonic, suggesting they are real noncoding RNAs rather than annotation artifacts. The presence of organellar sequences within the nuclear genome is not surprising, as Cucurbita genomes are known to suffer from multiple duplication and transfer events between the different genomic compartments of the cell, especially within the mitochondrial genome (81; 82). This phenomenon, coupled with the already high rates of nuclear genome evolution in Cucurbita (19), indicates that organellar transfers into the nucleus may be a source of novel regulatory elements in Cucurbita genomes. The function of all these candidate long noncoding RNAs or their role in the domestication of C. argyrosperma remains elusive, but our results indicate that these noncoding transcripts are good candidates for future functional characterization. Concluding remarks This work represents the first study of domestication in a Cucurbita species using genomic data, unraveling its underlying demographic patterns and selective pressures. We consider that the sympatric distribution and constant gene flow between C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, in addition to the maintenance of genetic diversity through traditional agricultural practices, has aided in the conservation of genetic variation in this species. 14 Our analyses support the notion that C. argyrosperma was domesticated in Jalisco and the adjacent Balsas region, as suggested by previous studies (8; 15), alongside other important crop species such as Zea mays (16) and Phaseolus vulgaris (80). The demographic patterns of C. argyrosperma we observe in our genomic data coincide with the earliest presence of humans in Mesoamerica (49) and the emerging patterns of human migration throughout the biological and cultural corridors in Mesoamerica (18). We found several signals of selection associated with domestication traits in C. argyrosperma subsp. argyrosperma, as well as selective pressures on uncharacterized proteins and long noncoding RNAs that should be further studied. The list of candidate loci will be useful for future studies where such genes could be validated, as candidate domestication genes are usually of agronomic value for crop improvement (83) or de novo crop domestication (84; 85). The availability of a wild Cucurbita genome is a helpful resource for future studies that may want to avoid using reference genomes of domesticated taxa in a comparative fashion. It will also serve as a resource to find genetic variation in wild Cucurbita populations that may be useful for improvement programs (86). Methods Genome assembly and annotation of Cucurbita argyrosperma subsp. sororia We sequenced the genome of a C. argyrosperma subsp. sororia individual collected in Oaxaca, Mexico. Total DNA was extracted from leaf tissue and sequenced using PacBio Sequel and Illumina HiSeq4000. We filtered the Illumina sequences using the qualityControl.py script (https://github.com/Czh3/NGSTools) to retain the reads with a PHRED quality ≥ 30 in 85% of the sequence and an average PHRED quality ≥ 25. The Illumina adapters were removed using SeqPrep (https://github.com/jstjohn/SeqPrep) and the paired reads that showed overlap were merged. The chloroplast genome was assembled using NOVOplasty (87) and the organellar reads were filtered using Hisat2 (88) against the chloroplast genome of C. argyrosperma subsp. argyrosperma (19) and the mitochondrial genome of C. pepo (81). We assembled the nuclear genome into small contigs using the Illumina reads and the Platanus assembler (89). The Platanus contigs were assembled into larger contigs using the PacBio Sequel reads and DBG2OLC (90). We performed two iterations of minimap2 and racon (91) to obtain a consensus genome assembly by mapping the PacBio reads and the Platanus contigs against the DBG2OLC backbone. We performed three additional polishing steps using PILON (92) by mapping the Illumina reads against the consensus genome assembly with BWA mem (93). The genome annotation processes were performed using the GenSAS v6.0 online platform (94). The transposable elements within the genome were predicted and masked using RepeatModeler (http://www. repeatmasker.org/RepeatModeler/). We downloaded five RNA-seq libraries of Cucurbita argyrosperma available on the Sequence Read Archive (accessions SRR7685400, SRR7685404 - SRR7685407), preformed the same quality filters described above for the Illumina genomic sequences, and aligned the high-quality reads against the masked genome of C. argyrosperma subsp. sororia using STAR v2.7 (95). We used filterBAM from the Augustus repository (96) to filter low-quality alignments and used the remaining alignments as RNA-seq evidence for BRAKER2 (96; 97; 23; 24). Anchoring the reference genomes into pseudomolecules We obtained the Pacbio corrected reads from the PacBio RSII reads of C. argyrosperma subsp. argyrosperma (NCBI SRA accession SRR7685401) and the PacBio Sequel reads of C. argyrosperma subsp. sororia using CANU (98). We anchored the genome assemblies of Cucurbita argyrosperma subsp. argyrosperma (19) and C. 15 argyrosperma subsp. sororia into pseudomolecules using RaGOO (25) alongside the PacBio corrected reads of each taxon to detect and correct misassemblies, using a gap size of 2600 bp for chromosome padding (average gap length of C. argyrosperma subsp. argyrosperma genome assembly) and using the genome assembly of Cucurbita moschata (22) as reference. We evaluated the synteny and rearrangements between Cucurbita genomes using Synmap2 (99), prommer (100) and Smash (101) using a minimum block size of 100,000 nt, a threshold of 1.9 and a context of 28. Sample collection, DNA extraction and sequencing We collected seeds from 26 populations of C. argyrosperma subsp. argyrosperma (domesticated populations), 15 populations of C. argyrosperma subsp. sororia (wild populations), and three feral populations, covering most of the reported distribution of this species throughout Mexico (6). The seeds were germinated in a greenhouse until the seedlings started to grow leaves. We extracted total DNA from fresh leaves using a DNeasy Plant Kit (Qiagen) of 192 individuals across the collected populations (Table S2), including five individuals of Cucurbita moschata to be used as outgroup, all which were sequenced by Data2Bio LLC using the tGBS method (27) with an Ion Proton instrument and two restriction enzymes (Sau3AI/BfuCI and NspI). Data filtering and SNP genotyping The raw reads were trimmed using LUCY2 (102), removing bases with PHRED quality scores < 15 using overlapping sliding windows of 10bp. Trimmed reads shorter than 30 bp were discarded. The trimmed reads were mapped against the chromosome-level genome assembly of Cucurbita argyrosperma subsp. argyrosperma using segemehl (103), since empirical studies suggest this program outperforms other read-mapping software while using Ion Torrent reads (104). We only retained the reads that mapped uniquely to one site of the reference genome for subsequent analyses. We used BCFtools (105; 106) for an initial variant calling step, retaining variants with at least 6 mapped reads per individual per site where the reads had a minimum PHRED quality score of 20 in the called base and a minimum mapping quality score of 20 (107). We used plink (108) to perform additional filters, such as retaining only biallelic SNPs, retaining SNPs with no more than 50% of missing data, individuals with no more than 50% of missing data and sites with a MAF of at least 1% (13K dataset). In order to obtain an adequate SNP dataset to infer the demographic history of C. argyrosperma, we performed additional filters to the 13k dataset with plink (108), including the elimination of all the SNPs that diverged significantly (p < 0.01) from the Hardy-Weinberg equilibrium exact test (109), and the elimination of adjacent SNPs in the genome with a squared correlation coefficient (R2) larger than 0.25 within 100 kbp sliding windows with a step size of 100 bp. In order to obtain an adequate SNP dataset to detect selective sweeps associated with the domestication of C. argyrosperma, we eliminated all the feral individuals of C. argyrosperma, which could not be assigned to either a wild or a domesticated population, as well as the five individuals of C. moschata. We also eliminated the SNP sites with more than 50% missing data and performed a MAF filter of 1% after reducing the number of individuals in the 13K dataset. The SNP density in this dataset was calculated using VCFtools (110) and the LD decay was calculated using plink (108) with a minimum R2 threshold of 0.001. Population structure We used diveRsity (111) to calculate the pairwise FST statistics, using 100 bootstraps to calculate the 95% confidence intervals. We used STACKS (112) to calculate the genetic variation in the wild, domesticated and feral populations. We used ADMIXTURE (29) to evaluate the genetic structure among the wild and 16 domesticated populations of C. argyrosperma, evaluating the assignation of individuals into one (CV error = 0.26205), two (CV error = 0.25587), three (CV error = 0.25806) and four (CV error = 0.26658) K populations. We used SNPhylo (28) to reconstruct a neighbor-joining maximum-likelihood tree with genetic distances between all the individuals in the dataset. We performed 100 bootstraps to assess the reliability of the tree topology. Coalescent simulations We used coalescent simulations to test two different possible scenarios of divergence between the genetic groups observed in our ADMIXTURE and SNPhylo results to determine the most likely region where domestication occurred. In the first scenario, the wild populations in Jalisco represent the oldest lineage and the wild populations in southern Mexico and the domesticated populations coalesce with it. In the second scenario, the wild group in southern Mexico is the oldest lineage and the wild populations in Jalisco and domesticated populations coalesce with it. We also tested three scenarios of divergence. The first scenario assumes that gene flow occurred continuously during the divergence, the second scenario assumes that divergence occurred without gene flow and there was a posterior secondary contact, where gene flow occurred. The third scenario assumes that divergence occurred without gene flow and divergent populations never came into contact. For each model, we estimated the time of divergence (T) between genetic groups and their effective population size (Ne). For models that included migration between groups, we also estimated the migration rate (m). We used Fastsimcoal 2 (30; 31) to determine the parameters that maximize the composite likelihood of each model given the unfolded multidimensional Site Frequency Spectrum (SFS). The unfolded multidimensional SFS was obtained with DADI (113), using 17 wild individuals of Jalisco, 27 wild individuals of southern populations, 123 domesticated individuals and 5 individuals of C. moschata used as an outgroup to unfold the SFS. We ran 100,000 simulations with 20 replicates for each model (two divergence scenarios and three gene flow scenarios) using the following settings: -M 0.001, -C, -L 40 and –N 200,000. We also selected log-uniform priors for parameter estimations, setting times of divergence between 1000 and 200,000 generations, effective population sizes between 100 and 60,000 individuals and migration rates between 0.0001 and 0.5. We also constringed the times of divergence in all scenarios, forcing the domesticated taxa to diverge after the wild relatives. After corroborating that all replicates converged to similar likelihoods, we combined all replicates and retained all outputs that were above the 95% of the likelihood distribution. We selected the best model based on a Tukey test (p-value < 0.01) between the top 5% distribution of the composite likelihoods of the six models, the Akaike information criteria of the highest likelihood of each model and the differential between the highest observed and estimated composite likelihoods. Tests to detect selective sweeps We used BayeScEnv (34) to detect candidate SNPs that were differentiated between the wild and domesticated populations of C. argyrosperma. For the “environmental” values used by BayeScEnv, we assigned each population as either wild (0) or domesticated (1). The SNPs with q-values < 0.05 were regarded as candidate loci under selection. The Mahalanobis distances implemented in PCAdapt (35) were used to detect candidate SNPs after controlling for the first two principal components in our dataset, which correspond to the subspecies and geographical differentiation observed during the population structure analysis. We performed Bonferroni corrections to adjust the p-values and the SNPs with p-values <0.05 were regarded as candidate loci under selection. We used snpEff (114) to associate the candidate loci found in both tests with the genome annotation of C. argyrosperma subsp. argyrosperma (19). The genes that could be unambiguously assigned to a candidate SNP 17 were screened for a Gene Ontology enrichment analysis using topGO and the weight01 algorithm (115). We determined the significantly enriched biological functions by performing Fisher’s exact test. The convergent SNPs under selection were compared between the C. argyrosperma subsp. argyrosperma and the C. argyrosperma subsp. sororia genome assemblies by extracting 5 kb upstream and downstream of the convergent SNPs under selection and detecting the orthologous region in both assemblies through a BLAST search. The orthologous regions were aligned using MUSCLE (116) and manually inspected using AliView (117). The synteny map between chromosomes 3 and 7 was generated using SynMap2 (99). Data Availability The chromosome-level genome assemblies of C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia are available in the Cucurbit Genomics Database (118) and in the NCBI RefSeq database (accessions XXXXXXXXX). The raw sequencing reads of the C. argyrosperma subsp. sororia genome are available in the NCBI Sequence Read Archive (accessions SRPXXXXXX- SRPXXXXXX). The raw sequencing reads of each individual sequenced by tGBS are available in the NCBI Sequence Read Archive (accessions SRPXXXXXX- SRPXXXXXX; table S2). The 13k SNP dataset is available in Dataset S1. Acknowledgments This work was funded by CONABIO KE004 ”Diversidad genetica de las especies de Cucurbita en Mexico e hibridacion entre plantas geneticamente modificadas y especies silvestres de Cucurbita” and CONABIO PE001 “Diversidad genetica de las especies de Cucurbita en Mexico. Fase II. Genomica evolutiva y de poblaciones, recursos geneticos y domesticacion” (both awarded to R.L-S., L.E.E. and S.M-H.). J.B-R. is a doctoral student from Programa de Doctorado en Ciencias Biomedicas, Universidad Nacional Autonoma de Mexico, and received fellowship 583146 from CONACyT. We acknowledge the Doctorado en Ciencias Biomedicas for the support provided during the development of this project. We thank Jodi Lynn Humann for her technical support while using the GenSAS web server. Special thanks to Laura Espinosa-Asuar for her technical support. We thank Rodrigo Garcia Herrera, head of the Scientific Computing Department at LANCIS, Instituto de Ecologia UNAM, for running the HTC infrastructure we used for the genome assembly and chromosome anchoring. The population genomic analyses were carried out using CONABIO’s computing cluster, which was partially funded by SEMARNAT through the grant “Contribucion de la Biodiversidad para el Cambio Climatico” to CONABIO. References 1. R. S. Meyer, M. D. Purugganan, Evolution of crop species: genetics of domestication and diversification. Nat Rev Genet 14, 840–52 (2013). 2. M. A. Zeder, Core questions in domestication research. Proc Natl Acad Sci U S A 112, 3191–8 (2015). 3. L. Kistler, et al., Gourds and squashes (Cucurbita spp.) adapted to megafaunal extinction and ecological anachronism through domestication. Proc Natl Acad Sci U S A 112, 15107–12 (2015). 4. R. Lira, et al., “Homo sapiens–Cucurbita interaction in Mesoamerica: Domestication Dissemination, and Diversification” in Ethnobotany of Mexico, (Springer New York, 2016), pp. 389–401. 5. G. Chomicki, H. Schaefer, S. S. Renner, Origin and domestication of Cucurbitaceae crops: insights from phylogenies, genomics and archaeology. New Phytol (2019). 18 6. G. Castellanos-Morales, et al., Historical biogeography and phylogeny of Cucurbita: Insights from ancestral area reconstruction and niche evolution. Mol Phylogenet Evol 128, 38–54 (2018). 7. H. S. Paris, “Genetic Resources of Pumpkins and Squash Cucurbita spp.” in Genetics and Genomics of Cucurbitaceae, (Springer International Publishing, 2016), pp. 111–154. 8. O. I. Sanjur, D. R. Piperno, T. C. Andres, L. Wessel-Beaver, Phylogenetic relationships among domesticated and wild species of Cucurbita (Cucurbitaceae) inferred from a mitochondrial gene: Implications for crop plant evolution and areas of origin. Proc Natl Acad Sci U S A 99, 535–40 (2002). 9. Z. Kerem, S. Lev-Yadun, A. Gopher, P. Weinberg, S. Abbo, Chickpea domestication in the NeolithicLevant through the nutritional perspective. Journal of Archaeological Science 34, 1289–1293 (2007). 10. D. Zizumbo-Villarreal, A. Flores-Silva, P. C.-G. Marín, The Archaic Diet in Mesoamerica: Incentive for Milpa Development and Species Domestication. Economic Botany 66, 328–343 (2012). 11. T. W. Whitaker, H. C. Cutler, Cucurbits and cultures in the Americas. Economic Botany 19, 344–349 (1965). 12. M. Nee, The domestication of Cucurbita (Cucurbitaceae). Economic Botany 44, 56–68 (1990). 13. R. L. Jarret, I. J. Levy, T. L. Potter, S. C. Cermak, L. C. Merrick, Seed oil content and fatty acid composition in a genebank collection of Cucurbita moschata Duchesne and C. argyrosperma C. Huber. Plant Genetic Resources 11, 149–157 (2013). 14. R. Lira, J. Caballero, Ethnobotany of the wild Mexican Cucurbitaceae. Economic Botany 56 (2002). 15. G. Sánchez-de la Vega, et al., Genetic Resources in the Calabaza Pipiana Squash (Cucurbita argyrosperma) in Mexico: Genetic Diversity, Genetic Differentiation and Distribution Models. Front Plant Sci 9, 400 (2018). 16. D. R. Piperno, A. J. Ranere, I. Holst, J. Iriarte, R. Dickau, Starch grain and phytolith evidence for early ninth millennium B.P. maize from the Central Balsas River Valley, Mexico. Proc Natl Acad Sci U S A 106, 5019– 24 (2009). 17. B. D. Smith, The Initial Domestication of Cucurbita pepo in the Americas 10,000 Years Ago. Science 276, 932–934 (1997). 18. D. Zizumbo-Villarreal, P. Colunga-GarcíaMarín, Origin of agriculture and plant domestication in West Mesoamerica. Genetic Resources and Crop Evolution 57, 813–825 (2010). 19. J. Barrera-Redondo, et al., The Genome of Cucurbita argyrosperma (Silver-Seed Gourd) Reveals Faster Rates of Protein-Coding Gene and Long Noncoding RNA Turnover and Neofunctionalization within Cucurbita. Mol Plant 12, 506–520 (2019). 20. F. A. Simão, R. M. Waterhouse, P. Ioannidis, E. V. Kriventseva, E. M. Zdobnov, BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–2 (2015). 21. J. Montero-Pau, et al., De novo assembly of the zucchini genome reveals a whole-genome duplication associated with the origin of the Cucurbita genus. Plant Biotechnol J 16, 1161–1171 (2018). 22. H. Sun, et al., Karyotype Stability and Unbiased Fractionation in the Paleo-Allotetraploid Cucurbita Genomes. Mol Plant 10, 1293–1306 (2017). 23. K. J. Hoff, S. Lange, A. Lomsadze, M. Borodovsky, M. Stanke, BRAKER1: Unsupervised RNA-Seq Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32, 767–9 (2016). 19 24. K. J. Hoff, A. Lomsadze, M. Borodovsky, M. Stanke, Whole-Genome Annotation with BRAKER. Methods Mol Biol 1962, 65–95 (2019). 25. M. Alonge, et al., RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol 20, 224 (2019). 26. T. W. Whitaker, W. P. Bemis, Origin and evolution of the cultivated Cucurbita. Bulletin of the Torrey Botanical Club 102 (1975). 27. A. Ott, et al., tGBS® genotyping-by-sequencing enables reliable genotyping of heterozygous loci. Nucleic Acids Res 45, e178 (2017). 28. T. H. Lee, H. Guo, X. Wang, C. Kim, A. H. Paterson, SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics 15, 162 (2014). 29. D. H. Alexander, J. Novembre, K. Lange, Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19, 1655–64 (2009). 30. L. Excoffier, M. Foll, fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics 27, 1332–4 (2011). 31. L. Excoffier, I. Dupanloup, E. Huerta-Sánchez, V. C. Sousa, M. Foll, Robust demographic inference from genomic and SNP data. PLoS Genet 9, e1003905 (2013). 32. R. Cruz-Reyes, G. Avila-Sakar, G. Sánchez-Montoya, M. Quesada, Experimental assessment of gene flow between transgenic squash and a wild relative in the center of origin of cucurbits. Ecosphere 6, art248 (2015). 33. S. Montes-Hernandez, L. E. Eguiarte, Genetic structure and indirect estimates of gene flow in three taxa of Cucurbita (Cucurbitaceae) in western Mexico. Am J Bot 89, 1156–63 (2002). 34. P. de Villemereuil, O. E. Gaggiotti, A new FST-based method to uncover local adaptation using environmental variables. Methods in Ecology and Evolution 6, 1248–1258 (2015). 35. K. Luu, E. Bazin, M. G. Blum, pcadapt: an R package to perform genome scans for selection based on principal component analysis. Mol Ecol Resour 17, 67–77 (2017). 36. Y. Shang, et al., Biosynthesis, regulation, and domestication of bitterness in cucumber. Science 346, 1084– 8 (2014). 37. M. S. Campbell, C. Holt, B. Moore, M. Yandell, Genome Annotation and Curation Using MAKER and MAKER- P. Curr Protoc Bioinformatics 48, 4.11.1–39 (2014). 38. T. Derrien, et al., The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 22, 1775–89 (2012). 39. A. D. L. Nelson, et al., Evolinc: A Tool for the Identification and Evolutionary Comparison of Long Intergenic Non-coding RNAs. Front Genet 8, 52 (2017). 40. H. D. Wilson, Gene Flow in Squash Species. BioScience 40, 449–455 (1990). 41. R. G. Allaby, R. L. Ware, L. Kistler, A re-evaluation of the domestication bottleneck from archaeogenomic evidence. Evol Appl 12, 29–37 (2019). 42. R. Lira-Saade, Estudios taxonomicos ecogeograficos de las Cucurbitaceae Latinoamericanas de importancia economica (International Plant Genetic Resources Institute, Rome, Italy, 1995). 20 43. D. Jarvis, T. Hodgkin, “Farmer decision making and genetic diversity” in Genes in the Field, (CRC Press, 1999) https:/doi.org/10.1201/9781420049824.ch11. 44. D. I. Jarvis, et al., A global perspective of the richness and evenness of traditional crop-variety diversity maintained by farming communities. Proc Natl Acad Sci U S A 105, 5326–31 (2008). 45. S. Montes-Hernández, L. C. Merrick, L. E. Eguiarte, Maintenance of Squash (Cucurbita spp.) Landrace Diversity by Farmers' Activities in Mexico. Genetic Resources and Crop Evolution 52, 697–707 (2005). 46. J. Barrera-Redondo, et al., Variedades locales y criterios de selección de especies domesticadas del género Cucurbita (Cucurbitaceae) en los Andes Centrales del Perú: Tomayquichua, Huánuco. Botanical Sciences 98, 101–116 (2020). 47. G. Castellanos-Morales, et al., Tracing back the origin of pumpkins (Cucurbita pepo ssp. pepo L.) in Mexico. Proc Biol Sci 286, 20191440 (2019). 48. A. J. Ranere, D. R. Piperno, I. Holst, R. Dickau, J. Iriarte, The cultural and chronological context of early Holocene maize and squash domestication in the Central Balsas River Valley, Mexico. Proc Natl Acad Sci U S A 106, 5014–8 (2009). 49. W. Stinnesbeck, et al., The earliest settlers of Mesoamerica date back to the late Pleistocene. PLoS One 12, e0183345 (2017). 50. T. D. Dillehay, et al., New Archaeological Evidence for an Early Human Presence at Monte Verde, Chile. PLoS One 10, e0141923 (2015). 51. D. R. Piperno, I. Holst, L. Wessel-Beaver, T. C. Andres, Evidence for the control of phytolith formation in Cucurbita fruits by the hard rind (Hr) genetic locus: Archaeological and ecological implications. Proc Natl Acad Sci U S A 99, 10923–8 (2002). 52. D. R. Piperno, The Origins of Plant Cultivation and Domestication in the New World Tropics. Current Anthropology 52, S453–S470 (2011). 53. D. B. Lowry, et al., Responsible RAD: Striving for best practices in population genomic studies of adaptation. Mol Ecol Resour 17, 366–369 (2017). 54. D. B. Lowry, et al., Breaking RAD: an evaluation of the utility of restriction site-associated DNA sequencing for genome scans of adaptation. Mol Ecol Resour 17, 142–152 (2017). 55. G. Swinnen, A. Goossens, L. Pauwels, Lessons from Domestication: Targeting Cis-Regulatory Elements for Crop Improvement. Trends Plant Sci 21, 506–515 (2016). 56. D. H. Kim, et al., Small heat shock protein Hsp17.8 functions as an AKR2A cofactor in the targeting of chloroplast outer membrane proteins in Arabidopsis. Plant Physiol 157, 132–46 (2011). 57. K. H. Stanley, et al., Transcriptional divergence of the duplicated oxidative stress-responsive genes in the Arabidopsis genome. Plant J 41, 212–20 (2005). 58. P. Gaudet, M. S. Livstone, S. E. Lewis, P. D. Thomas, Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Brief Bioinform 12, 449–62 (2011). 59. J. Zhang, et al., Receptor-like cytoplasmic kinases integrate signaling from multiple plant immune receptors and are targeted by a Pseudomonas syringae effector. Cell Host Microbe 7, 290–301 (2010). 60. X. Moreira, L. Abdala-Roberts, R. Gols, M. Francisco, Plant domestication decreases both constitutive and induced chemical defenses by direct selection against defensive traits. Sci Rep 8, 12678 (2018). 21 61. K. Chen, et al., Abscisic acid dynamics, signaling, and functions in plants. J Integr Plant Biol 62, 25–54 (2020). 62. J. S. (P. Heslop-Harrison, T. Schwarzacher, “Genetics and genomics of crop domestication” in Plant Biotechnology and Agriculture, (Elsevier, 2012), pp. 3–18. 63. A. B. Martínez, et al., Differences in seed dormancy associated with the domestication of Cucurbita maxima: elucidation of some mechanisms behind this response. Seed Science Research 28, 1–7 (2017). 64. W. Jiang, D. Yu, Arabidopsis WRKY2 transcription factor mediates seed germination and post germination arrest of development by abscisic acid. BMC Plant Biol 9, 96 (2009). 65. T. J. Strabala, et al., Gain-of-function phenotypes of many CLAVATA3/ESR genes, including four new family members, correlate with tandem variations in the conserved CLAVATA3/ESR domain. Plant Physiol 140, 1331–44 (2006). 66. G. R. Rodríguez, et al., Distribution of SUN, OVATE, LC, and FAS in the tomato germplasm and the relationship to fruit shape diversity. Plant Physiol 156, 275–85 (2011). 67. C. Xu, et al., A cascade of arabinosyltransferases controls shoot meristem size in tomato. Nat Genet 47, 784–92 (2015). 68. J. L. Carpenter, S. E. Ploense, D. P. Snustad, C. D. Silflow, Preferential expression of an alpha-tubulin gene of Arabidopsis in pollen. Plant Cell 4, 557–71 (1992). 69. Y. Mei, H. B. Gao, M. Yuan, H. W. Xue, The Arabidopsis ARCP protein, CSI1, which is required for microtubule stability, is necessary for root and anther development. Plant Cell 24, 1066–80 (2012). 70. B. Burla, et al., Vacuolar transport of abscisic acid glucosyl ester is mediated by ATP-binding cassette and proton-antiport mechanisms in Arabidopsis. Plant Physiol 163, 1446–58 (2013). 71. C. Seiler, et al., ABA biosynthesis and degradation contributing to ABA homeostasis during barley seed development under control and terminal drought-stress conditions. J Exp Bot 62, 2615–32 (2011). 72. Y. Wang, et al., Construction of a High-Density Genetic Map and Analysis of Seed-Related Traits Using Specific Length Amplified Fragment Sequencing for Cucurbita maxima. Front Plant Sci 10, 1782 (2019). 73. D. Guo, et al., Resequencing 200 Flax Cultivated Accessions Identifies Candidate Genes Related to Seed Size and Weight and Reveals Signatures of Artificial Selection. Front Plant Sci 10, 1682 (2019). 74. J. Huang, R. Ghosh, V. A. Bankaitis, Sec14-like phosphatidylinositol transfer proteins and the biological landscape of phosphoinositide signaling in plants. Biochim Biophys Acta 1861, 1352–1364 (2016). 75. T. J. Raharjo, L. Nurliana, S. Mastjeh, Phospholipids from pumpkin (Cucurbita moschata (Duch.) Poir) seed kernel oil and their fatty acid composition. Indonesian Journal of Chemistry 11, 48–52 (2011). 76. S. Gonzalez-Jorge, et al., ZEAXANTHIN EPOXIDASE Activity Potentiates Carotenoid Degradation in Maturing Seed. Plant Physiol 171, 1837–51 (2016). 77. A. Frey, J. P. Boutin, B. Sotta, R. Mercier, A. Marion-Poll, Regulation of carotenoid and ABA accumulation during the development and germination of Nicotiana plumbaginifolia seeds. Planta 224, 622–32 (2006). 78. M. Murkovic, U. Mülleder, H. Neunteufl, Carotenoid Content in Different Varieties of Pumpkins. Journal of Food Composition and Analysis 15, 633–638 (2002). 79. G. Li, et al., Orphan genes are involved in drought adaptations and ecoclimatic-oriented selections in domesticated cowpea. J Exp Bot 70, 3101–3110 (2019). 22 80. M. Rendón-Anaya, et al., Genomic history of the origin and domestication of common bean unveils its closest sister species. Genome Biol 18, 60 (2017). 81. A. J. Alverson, et al., Insights into the evolution of mitochondrial genome size from complete sequences of Citrullus lanatus and Cucurbita pepo (Cucurbitaceae). Mol Biol Evol 27, 1436–48 (2010). 82. X. Aguirre-Dugua, et al., Evolutionary Dynamics of Transferred Sequences Between Organellar Genomes in Cucurbita. J Mol Evol 87, 327–342 (2019). 83. M. B. Hufford, et al., Comparative population genomics of maize domestication and improvement. Nat Genet 44, 808–11 (2012). 84. A. Zsögön, et al., De novo domestication of wild tomato using genome editing. Nat Biotechnol (2018). 85. A. R. Fernie, J. Yan, De Novo Domestication: An Alternative Route toward New Crops for the Future. Mol Plant 12, 615–631 (2019). 86. M. Xie, et al., A reference-grade wild soybean genome. Nat Commun 10, 1216 (2019). 87. N. Dierckxsens, P. Mardulyn, G. Smits, NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res 45, e18 (2017). 88. D. Kim, J. M. Paggi, C. Park, C. Bennett, S. L. Salzberg, Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019). 89. R. Kajitani, et al., Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res 24, 1384–95 (2014). 90. C. Ye, C. M. Hill, S. Wu, J. Ruan, Z. S. Ma, DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Sci Rep 6, 31900 (2016). 91. H. Li, Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018). 92. B. J. Walker, et al., Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9, e112963 (2014). 93. H. Li, R. Durbin, Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–95 (2010). 94. J. L. Humann, T. Lee, S. Ficklin, D. Main, Structural and Functional Annotation of Eukaryotic Genomes with GenSAS. Methods Mol Biol 1962, 29–51 (2019). 95. A. Dobin, et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). 96. M. Stanke, O. Sch¨offmann, B. Morgenstern, S. Waack, Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006). 97. A. Lomsadze, P. D. Burns, M. Borodovsky, Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42, e119 (2014). 98. S. Koren, et al., Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27, 722–736 (2017). 99. A. Haug-Baltzell, S. A. Stephens, S. Davey, C. E. Scheidegger, E. Lyons, SynMap2 and SynMap3D: web-based whole-genome synteny browsers. Bioinformatics 33, 2197–2198 (2017). 100. S. Kurtz, et al., Versatile and open software for comparing large genomes. Genome Biol 5, R12 (2004). 23 101. D. Pratas, R. M. Silva, A. J. Pinho, P. J. Ferreira, An alignment-free method to find and visualize rearrangements between pairs of DNA sequences. Sci Rep 5, 10203 (2015). 102. S. Li, H. H. Chou, LUCY2: an interactive DNA sequence quality trimming and vector removal tool. Bioinformatics 20, 2865–6 (2004). 103. S. Hoffmann, et al., Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol 5, e1000502 (2009). 104. S. Caboche, C. Audebert, Y. Lemoine, D. Hot, Comparison of mapping algorithms used in high throughput sequencing: application to Ion Torrent data. BMC Genomics 15, 264 (2014). 105. H. Li, et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–9 (2009). 106. H. Li, A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–93 (2011). 107. H. Li, J. Ruan, R. Durbin, Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18, 1851–8 (2008). 108. S. Purcell, et al., PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559–75 (2007). 109. J. E. Wigginton, D. J. Cutler, G. R. Abecasis, A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet 76, 887–93 (2005). 110. P. Danecek, et al., The variant call format and VCFtools. Bioinformatics 27, 2156–8 (2011). 111. K. Keenan, P. McGinnity, T. F. Cross, W. W. Crozier, P. A. Prodöhl, diveRsity: An R package for the estimation and exploration of population genetics parameters and their associated errors. Methods in Ecology and Evolution 4, 782–788 (2013). 112. J. Catchen, P. A. Hohenlohe, S. Bassham, A. Amores, W. A. Cresko, Stacks: an analysis tool set for population genomics. Mol Ecol 22, 3124–40 (2013). 113. R. N. Gutenkunst, R. D. Hernandez, S. H. Williamson, C. D. Bustamante, Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet 5, e1000695 (2009). 114. P. Cingolani, et al., A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012). 115. A. Alexa, J. Rahnenführer, T. Lengauer, Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22, 1600–7 (2006). 116. R. C. Edgar, MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792–7 (2004). 117. A. Larsson, AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30, 3276–8 (2014). 118. Y. Zheng, et al., Cucurbit Genomics Database (CuGenDB): a central portal for comparative and functional genomics of cucurbit crops. Nucleic Acids Res 47, D1128–D1136 (2019). 82 DISCUSIÓN El estudio genómico de las calabazas nos permitió estudiar con detalle eventos evolutivos que ocurrieron en distintas escalas temporales, una a una escala macroevolutiva, como los efectos de una duplicación completa del genoma que ocurrió en el género Cucurbita hace ~30 millones de años (Barrera-Redondo et al., 2019) y otra a una escala microevolutiva, el proceso de domesticación de C. argyrosperma que comenzó hace tan solo ~13,800 años (Barrera-Redondo et al., 2020a). Los dos eventos estudiados en esta tesis están relacionados en términos de contingencia histórica (Blount et al., 2018). La duplicación completa del genoma derivó en el cambio y origen de elementos nuevos que pudieron ser o no seleccionados durante el proceso de domesticación. Asimismo, muchos de los resultados de la domesticación fueron deterministas en vez de contingentes (Blount et al., 2018), ya que las calabazas domesticadas muestran muchos de los síndromes de domesticación que se han observado en otras cucurbitáceas, tales como la pérdida de la cucurbitacina, el crecimiento del tamaño de la planta, el crecimiento del fruto, el crecimiento de la semilla y la pérdida de genes relacionados al estrés biótico y abiótico (Chomicki et al., 2019). Los resultados obtenidos en esta tesis nos abordan diversas cuestiones de relevancia: ya sea conocimiento básico de los procesos de evolución molecular, o el entendimiento del papel que juegan estos procesos evolutivos en la aparición de uno de los organismos domesticados que forman la base de nuestra alimentación. Tomando como ejemplo los procesos de duplicación completa del genoma, se puede discutir la posible existencia de una distinción relevante entre los procesos considerados como microevolutivos y macroevolutivos. La observación de nuevos genes codificantes y ARNs largos no codificantes en Cucurbita nos permite abordar otra cuestión crucial en evolución, que es el origen de las funciones a escala molecular, es decir: a qué nos referimos con una función en biología, de qué manera se pueden inferir funciones en los distintos elementos que componen a un genoma y cómo pueden surgir funciones nuevas en los genomas. Los análisis poblacionales realizados en C. argyrosperma nos revelan cómo la genética de poblaciones, a 83 pesar de sus limitantes, puede integrar la información genómica con el contexto ecológico e histórico de esta planta para elucidar un escenario concreto de su domesticación en Mesoamérica. Finalmente, el entendimiento de estos procesos de domesticación nos guía a plantear posibles usos de estos conocimientos para el beneficio de la sociedad, ya sea por medio de la conservación de los recursos fitogenéticos o mediante mejoramiento genético de los cultivos. La duplicación completa del genoma como evento macroevolutivo: ¿Una cuestión de escala de tiempo o de propiedades emergentes? Se ha argumentado que la distinción entre la macro y microevolución se suele abarcar en tres contextos distintos: 1) cuando aparecen planes corporales nuevos que puedan catalogarse como el origen de un jerarquía taxonómica supra- específica; 2) cuando se habla de procesos de especiación y extinción, donde la unidad de evolución son las especies; 3) y cuando se habla de la acumulación de procesos evolutivos en largas escalas de tiempo (Hautmann, 2020). El contexto del origen de nuevos planes corporales dio pie a la distinción original entre macro y microevolución (Philiptschenko, 1927), y ha sido un recurso teórico frecuentemente utilizado por los defensores del saltacionismo (Goldschmidt, 1933), quienes más recientemente recurren a la biología evolutiva del desarrollo (Evo-Devo) para tratar de encontrar cambios en los patrones de expresión genética durante el desarrollo que puedan generar cambios importantes en los planes corporales (Theißen, 2009). En la actualidad existen algunos ejemplos que se ajustan al modelo saltacionista, aunque parecen tratarse de eventos excepcionales, al contrario de los cambios graduales que parecen ser la norma en la evolución biológica (Chouard, 2010). A pesar de la plausibilidad de los eventos saltacionistas (Chouard, 2010), este tipo de eventos aparentemente no son cualitativamente distintos de los procesos microevolutivos, ya que la fijación de este tipo de caracteres “saltacionistas” u otros fenómenos observados por la Evo-Devo pueden analizarse 84 desde una perspectiva poblacional (Nunes et al., 2013). Bajo esta lógica, solo los procesos macroevolutivos donde ocurre selección a nivel de especie (Jablonski, 2008) podrían considerarse como fenómenos cualitativamente distintos que no pueden extrapolarse a la acumulación de procesos microevolutivos (Hautmann, 2020). Sin embargo, dicha definición tiene sus propias complicaciones, como lo es la delimitación, y por lo tanto definición, de una especie (Mallet, 1995). La duplicación completa de un genoma, como la que distingue a las calabazas de otros miembros de la familia Cucurbitaceae resulta ser un evento puntual de alopoliploidía (Sun et al., 2017; Montero-Pau et al., 2018). Éste podría interpretarse con un evento semi-saltacionista a nivel genómico, ya que los procesos de fraccionamiento y diploidización en plantas suelen involucrar la escisión repentina de grandes segmentos del genoma, más que una pérdida gradual gen por gen (Wang et al., 2011). Sin embargo, nuestros resultados con el genoma de C. argyrosperma indican la prevalencia de un gran número de pseudogenes, los cuales suelen encontrarse con mayor frecuencia en uno de los dos pares de cromosomas homeólogos — como encontramos en el análisis de sintenia entre los cromosomas 3 y 7 — pero separados por millones de pares de bases que contienen genes codificantes conservados (Barrera-Redondo et al., 2020a). Lo anterior apoya la idea de un proceso más gradual de pérdida gen por gen en los genomas de Cucurbita, ya que su diploidización fue mayoritariamente el resultado de eventos independientes de pseudogenización, en lugar de eventos de escisión de los genes duplicados (Barrera-Redondo et al., 2019; Barrera-Redondo et al., 2020a). Adicionalmente, el análisis de expansión y contracción de familias génicas dentro de Cucurbita muestra que sigue habiendo un ritmo alto de recambio de genes dentro del género. Esto indica que el proceso de fraccionamiento continúa ocurriendo de manera paralela y gradual más de 30 millones de años después del evento de duplicación de genoma en las distintas especies de Cucurbita, en conjunto con la tasa basal de recambio de genes de la familia Cucurbitaceae (Barrera-Redondo et al., 2019). 85 Por lo tanto, la evolución de familias génicas y de la estructura del genoma, más que un cambio de naturaleza saltacionista puede considerarse como una extensión a largo plazo de la microevolución. Los elementos de presencia/ausencia, las variaciones en número de copia y los cambios estructurales en los cromosomas se han observado y analizado en otros genomas a nivel poblacional desde hace más de una década (e.g., Wong et al., 2007; Tan et al., 2012; Hu et al., 2018), siendo estos catalogados como variación intraespecífica (Hautmann, 2020). Inclusive, los procesos de duplicación completa del genoma y fraccionamiento se han visto asociados a procesos de estructuración poblacional y eventual especiación entre poblaciones (Williams et al., 2016), convirtiéndolos nuevamente en una extensión de procesos microevolutivos, pero con repercusiones a gran escala en la evolución de los linajes (Williams et al., 2016; Hautmann, 2020). Cuando se habla de eventos de tipo saltacionista se suele hacer referencia a cambios en el fenotipo, más que a cambios dramáticos en la arquitectura del genoma (Goldschmidt, 1933; Theißen, 2009). El estudio de estos eventos se suele enfocar a cambios en las redes de regulación transcripcional (Theißen, 2009; Chouard, 2010). Sin embargo, los posibles efectos fenotípicos de las duplicaciones completas del genoma han sido menos estudiados, salvo por algunos ejemplos en Brassicales y Eudicotiledonias (Edger et al., 2015; Chanderbali et al., 2017; Clark y Donoghue, 2018). Recientemente, se realizó un estudio filogenómico en Cucurbitaceae donde se encontraron cuatro eventos independientes de duplicaciones genómicas en la familia (Guo et al., 2020). Estas incluyen una duplicación previamente descrita asociada al origen de la familia Cucurbitaceae (Wang et al., 2018) y a duplicación encontrada en el género Cucurbita (Sun et al., 2017). La duplicación completa del genoma asociada al origen de Cucurbitaceae fue determinante en la aparición de zarcillos, y posiblemente también del fruto pepónide, en dicha familia. Además, se encontró una asociación cercana entre dicha duplicación genómica y dos eventos de diversificación en Cucurbitaceae (Guo et al., 2020). Esto sugiere que las duplicaciones completas del genoma son una importante fuente de innovación evolutiva en las plantas, ya que la redundancia funcional resultante de estas duplicaciones las hace más evolucionables, 86 promoviendo la aparición de novedades fenotípicas y la aparición de nuevos linajes (Barrera-Redondo et al., 2020b). La repercusión de la duplicación completa del genoma sobre la aparición de diferencias morfológicas concretas que distingan a Cucurbita como un género no ha sido abordado de manera puntual. Los análisis de enriquecimiento funcional en genes duplicados de Cucurbita sugieren que éstos están asociados al desarrollo de estrategias adaptativas de supervivencia y al crecimiento y maduración del fruto (Sun et al., 2017). Sin embargo, los análisis filogenómicos de Guo et al. (2020) sugieren que esta duplicación genómica es compartida por todas las especies de la tribu Cucurbiteae, que incluye a Cucurbita y a otros 12 géneros. Por lo tanto, es necesario realizar más estudios a lo largo de los distintos géneros de la tribu Cucurbiteae para entender mejor qué caracteres morfológicos o fisiológicos coinciden cronológicamente con el evento de la duplicación (Barrera-Redondo et al., 2020b). Estudios enfocados en duplicaciones del genoma en diversos linajes de plantas han propuesto que los genes retenidos durante una duplicación completa del genoma suelen estar enriquecidos en factores transcripcionales y proteínas que forman complejos multi-proteicos que son sensibles a los cambios de dosis. Durante los eventos de poliploidización, ocurre un balanceo de dosis que permite la retención de este tipo de genes, a diferencia de las duplicaciones en tandem (Clark y Donoghue, 2018; Rendón-Anaya et al., 2019). Esto sugiere que las duplicaciones en tándem suelen favorecer la expansión de genes involucrados en adaptaciones locales recientes, mientras que las duplicaciones completas del genoma favorecen la expansión de genes involucrados en procesos basales de fisiología y desarrollo, implicando un posible efecto sobre los planes corporales (Clark y Donoghue, 2018; Rendón-Anaya et al., 2019). En el caso de las calabazas, los estudios de expresión diferencial muestran cambios sustanciales en la expresión genética entre genes homeólogos de los dos subgenomas del género Cucurbita (i.e., las dos copias diferenciadas de los cromosomas generadas durante el evento de alopoliploidización). Esto indica que 87 las copias de los genes duplicados han cambiado en cuanto a patrones relativos de expresión debido a rearreglos en las redes de regulación transcripcional (Sun et al., 2017). Esto abre la posibilidad de que la aparición de elementos reguladores novedosos, como los RNAs largos no codificantes, pudieron jugar un papel importante en los cambios de las redes de regulación transcripcional. Un estudio de expresión diferencial durante las etapas del desarrollo de los órganos podría revelar el papel de los genes duplicados en Cucurbiteae con rasgos morfológicos particulares de la tribu. Un rasgo particular en calabazas es su alta diversidad morfológica en frutos, considerada una de las más altas en el reino Plantae (Bisognin, 2002). Esta alta variación morfológica podría estar directa o indirectamente relacionada a la duplicación del genoma (Mattenberger et al., 2017). Consistente con esta hipótesis, cuatro de los genes duplicados en Cucurbita pertenecen a la familia OVATE, implicados en la regulación de la forma del fruto (Sun et al., 2017). Dicha variación morfológica puede estar dada por una alta evolucionabilidad de nuevos fenotipos en Cucurbiteae, capacidad dada por la duplicación de estos genes (Sun et al., 2017; Barrera-Redondo et al., 2020b). También es posible que las calabazas presenten una mayor plasticidad fenotípica que pueda haber surgido a partir de la duplicación completa del genoma, permitiendo una amplia variedad de formas en los frutos (Mattenberger et al., 2017). Esta hipótesis se puede poner a prueba, realizando un análisis con métodos filogenéticos comparativos para evaluar la correlación de la duplicación completa del genoma contra un incremento en la diversidad morfológica en los distintos géneros que componen a Cucurbiteae, comparando con otros taxa cercanos sin eventos de duplicación (e.g., Clark y Donoghue, 2018). El proceso de fraccionamiento y retención de genes duplicados en calabazas no parece ser azaroso, dado que se encuentran enriquecidos en ciertas funciones, a pesar de haber ocurrido un proceso de fraccionamiento equilibrado (Sun et al., 2017; Clark y Donoghue, 2018). La retención de genes duplicados puede ser entendida por el efecto de la selección positiva sobre las copias que lograron neofuncionalizarse, mediante la acumulación de mutaciones con cambio de función, 88 que a su vez fue posible por la relajación en las presiones de selección purificadora dada una redundancia funcional (Clark y Donoghue, 2018). Otra forma de ver este proceso es mediante la aparición de nuevos elementos que interactúan en las redes de regulación transcripcional, que mediante la pérdida de ciertos elementos desembocan en la formación de nuevas configuraciones estables llamadas atractores, las cuales dan pie a los diferentes linajes celulares de las plantas (Crombach y Hogeweg, 2008). La evolución de estas redes, y por lo tanto la evolución de los fenotipos que generan, se ve ligada a la tasa de recambio de los elementos que componen dicha red (Crombach y Hogeweg, 2008). En este sentido, la alta tasa de nacimiento y muerte de genes que observamos en el género Cucurbita pudo haber cambiado las conexiones en las redes de regulación transcripcional (Barrera-Redondo et al., 2019). Por otro lado, la tasa de recambio de lincRNAs se suele acelerar cuando ocurren “perturbaciones” a grande escala en los genomas de las plantas, tales como rearreglos cromosómicos, cambios de cariotipo y duplicaciones genómicas (Nelson y Shippen, 2015). Este aumento en la tasa de recambio de lincRNAs podría explicarse mediante un cambio en el paisaje epigenético de un organismo después de una perturbación genómica, lo cual conlleva a la pérdida y ganancia de transcritos reguladores, hasta llegar a una nueva configuración estable de regulación transcripcional (Crombach y Hogeweg, 2008). La aceleración en la tasa de recambio de lincRNAs en Cucurbita se ajustaría a esta hipótesis (Barrera- Redondo et al., 2019). La limitante de los análisis de redes radica en que no analiza explícitamente la aparición y cambio de los elementos en las redes, sino que solo modela los cambios en las interacciones de los elementos y su potencial fenotipo (Crombach y Hogeweg, 2008). Para ello, pareciera ineludible invocar a procesos de selección para entender los procesos de neofuncionalización de los genes y demás transcritos redundantes que se generaron durante la duplicación completa del genoma en las calabazas (Clark y Donoghue, 2018). De tal manera que estudiar las causas últimas del origen de la variación solo puede realizarse mediante análisis de evolución 89 molecular, más que estudios de la dinámica de las redes, los cuales se enfocan en la aparición de los fenotipos dada una red de elementos preexistentes o en los cambios de la red dada una tasa de recambio de elementos (Crombach y Hogeweg, 2008). Innovación evolutiva a escala molecular El estudio de genómica comparada que se realizó en el género Cucurbita permitió elucidar el origen evolutivo de nuevos genes y lincRNAs en el género, de los cuales se podría discutir su funcionalidad en el genoma (Barrera-Redondo et al., 2019). El origen de nuevos elementos funcionales en los genomas de los organismos es una pregunta fundamental en la biología evolutiva (Lynch y Walsh, 2007; Van Oss y Carvunis, 2019). Sin embargo, definir qué es un elemento funcional en un genoma es complicado en sí mismo (Kellis et al., 2014). Existen diversas líneas de evidencia que no necesariamente se sobrelapan para determinar que un elemento en el genoma es funcional, tales como la observación de consecuencias fenotípicas cuando son perturbados, su conservación a nivel evolutivo o la presencia de su actividad bioquímica en las células (Kellis et al., 2014). Cada una de estas evidencias tiene sus limitaciones, en particular de aquellos elementos con tan sólo actividad bioquímica, de los cuales se puede sospechar, pero no corroborar función (Graur et al., 2015). Además, los elementos en los genomas no actúan de manera aislada, sino que sólo operan de acuerdo con una red de interacciones, por lo que su relación con un fenotipo es en ocasiones insuficiente para definir que son funcionales (Crombach y Hogeweg, 2008; Pigliucci, 2010). Tal es el caso de algunos microRNAs, los cuales no están directamente involucrados en la emergencia de un fenotipo, sino que ayudan a que el desarrollo embrionario avance de manera adecuada cuando las condiciones abióticas no son ideales (Bartel, 2009). Finalmente, se tiene evidencia de ARNs largos no codificantes que carecen de conservación a nivel evolutivo, pero que han sido validados como funcionales (Kapusta y Feschotte, 2014). 90 Inclusive el concepto de función en biología es problemático en un contexto evolutivo, considerando que los elementos que forman a un organismo no aparecen con un fin preexistente, sino que son la culminación de fenómenos de causalidad (Amundson, y Lauder, 1994), ya sean procesos de autoorganización (Kauffman, 1993) o dinámicas históricas de mutación y selección (Long et al., 2013; Vakirlis et al., 2020). Por ello utilizaré del concepto “etiológico” de función, mejor conocido como el concepto de función como efecto seleccionado, donde un elemento se considera funcional de acuerdo con el efecto positivo que tiene en la adecuación del organismo que lo posee, independientemente de su origen causal (i.e., incluye adaptaciones y exaptaciones) (Wright, 1973; Amundson y Lauder, 1994; Graur et al., 2015). El concepto de función como efecto seleccionado nos permite enmarcar a las funciones biológicas en un contexto histórico, a diferencia de otras definiciones utilizadas en biología, como o es el concepto de función de rol causal. Este segundo concepto define a la función como la capacidad de un elemento (e.g., la capacidad de contracción de un músculo) de realizar una acción (e.g., movimiento) dentro de un sistema (e.g., una extremidad) (Cummins, 1975; Amundson y Lauder, 1994), dejando de lado el contexto histórico que dio origen a dicho elemento. Adicionalmente, el uso del concepto de función como efecto seleccionado nos permite clasificar a los elementos del genoma en cuatro categorías distintas. El “ADN literal” se puede definir como una secuencia funcional cuyo orden de nucleótidos se encuentra bajo selección (e.g., el marco de lectura de un gen codificante). El “ADN indiferente” son secuencias funcionales cuyo orden o identidad de nucleótidos no es importante, pero cuya presencia se encuentra bajo selección (e.g., la tercera posición de un codón de leucina). El ADN selectivamente neutral es aquel cuya presencia o ausencia no afecta la adecuación del organismo (e.g., algunas secuencias intergénicas), pero que tiene el potencial de adquirir una función nueva mediante la acumulación de mutaciones. Finalmente, el ADN deletéreo representa aquellas secuencias cuya existencia perjudica la adecuación del organismo que las posee (e.g., una inserción que altere un marco abierto de lectura) (Graur et al., 2015). 91 El problema de definir elementos funcionales en los genomas es particularmente importante en el estudio de los lincRNAs, ya que a pesar de existir múltiples ejemplos de actividad regulatoria en estos transcritos (Chekanova, 2015), su repercusión en el fenotipo es desconocida para la mayor parte de los lincRNAs descritos y tienen una baja tasa de conservación a nivel evolutivo (Ulitsky, 2016; Nelson et al., 2017). Algunos de los lincRNAs que han sido validados experimentalmente pueden ser clasificados como “ADN literal”, en el caso de aquellos transcritos que realizan su función reguladora a partir de su secuencia primaria o de su estructura secundaria, la cual se suele encontrar evolutivamente conservada (Chekanova, 2015; Graur et al., 2015). Pero también hay lincRNAs que pueden ser clasificados como “ADN indiferente”, que son aquellos transcritos cuya secuencia es irrelevante para su función, pero cuya actividad transcripcional actúa como regulador trascripcional en cis de uno o más genes en los alrededores por medio de interferencia transcripcional, ya sea porque comparten o se sobelapan sus secuencias promotoras (Shearwin et al., 2005; Chekanova, 2015). Es por ello por lo que los análisis de conservación evolutiva suelen ser insuficientes para determinar si un lincRNA es funcional o no (Kapusta y Feschotte, 2014). Nosotros encontramos ejemplos de lincRNAs con señales de selección purificadora y selección positiva en el genoma de C. argyrosperma (Barrera- Redondo et al., 2019; Barrera-Redondo et al., 2020a), por lo cual pueden considerarse funcionales como efecto seleccionado (Wright, 1973; Graur et al., 2015). Sin embargo, sigue siendo un misterio su repercusión en el fenotipo, dado que no presentan homología contra ningún transcrito validado experimentalmente en A. thaliana (Barrera-Redondo et al., 2019). A pesar de encontrar señales de selección en algunos de los lincRNAs de C. argyrosperma, el comportamiento de muchos de estos se asemeja al de ADN selectivamente neutral, ya que observamos una relación directa entre el nivel de conservación de los lincRNAs en cucurbitáceas y la distancia evolutiva entre los taxa (Barrera-Redondo et al., 2019), aunque esto no descarta la posible función de estos transcritos (Kapusta y Feschotte, 2014). 92 Otros elementos genómicos que presentan estos mismos problemas son los genes de origen evolutivo reciente, conocidos como genes huérfanos o genes de novo (Keeling et al., 2019; Van Oss y Carvunis, 2019). El nacimiento de genes de novo se refiere al origen de un gen nuevo y funcional a partir de ADN intergénico y selectivamente neutral (Van Oss y Carvunis, 2019). A pesar de representar una de las interrogantes más básicas en la biología evolutiva (Van Oss y Carvunis, 2019), el análisis funcional de los genes con aparición de novo ha sido complicado no solo por su origen reciente, sino también por la omisión de utilizar un concepto concreto de función al analizar estos elementos (Keeling et al., 2019). Nosotros encontramos señales de selección positiva en al menos dos genes con un origen potencialmente de novo exclusivos de calabazas y de C. argyrosperma (Barrera-Redondo et al., 2020a), lo cual nuevamente se ajusta al concepto de función como efecto seleccionado (Wright, 1973; Graur et al., 2015). Sin embargo, la tasa de nacimiento de novo puede llegar a sobreestimarse con métodos tradicionales para buscar homología, por lo que los análisis de sintenia con otros genomas pueden ayudar a verificar la aparición de novo de un gen (Casola, 2018). Se requieren de validaciones experimentales que vayan más allá de los organismos modelo para entender la función de aquellos elementos con conservación evolutiva baja como los genes de novo y los lincRNAs, lo que nos ayudará a comprender la importancia de estas novedades evolutivas y sus repercusiones en el fenotipo (Casola, 2018). Esto se puede realizar por medio de ensayos “knockout” de estos elementos, utilizando tecnologías de edición genética que se encuentran disponibles en diversos taxa de plantas (Zhang et al., 2017). Otra forma eficiente de inferir la función de estos elementos sería mediante experimentos de expresión diferencial, para asociar su actividad transcripcional a cierto órgano, etapa del desarrollo o respuesta a estrés (Nelson et al., 2017). Los eventos de duplicación completa del genoma pueden promover la aparición de genes de novo a partir de ADN no codificante, como se ha observado en genomas de hongos microesporidios (Williams et al., 2016). Esto parece haber 93 ocurrido en las calabazas, ya que se encontraron algunos genes codificantes en el genoma de C. argyrosperma sin homología a otras proteínas reportadas en GenBank (Barrera-Redondo et al., 2020a). En cuanto a la tasa de nacimiento de lincRNAs, la duplicación completa del genoma resultó en la aparición de una gran cantidad de transcritos exclusivos de calabazas, varios de los cuales surgieron por medio de la duplicación y pseudogenización de genes codificantes (Barrera-Redondo et al., 2019). El haber encontrado señales de selección en estos elementos sugiere que los pseudogenes pueden neofuncionalizarse como elementos regulatorios (Barrera-Redondo et al., 2019; Barrera-Redondo et al., 2020a). También se ha reportado que algunos genes codificantes surgen de novo a partir de ARNs no codificantes que adquieren marcos abiertos de lectura espontáneamente por mutaciones (Ruiz-Orera et al., 2014; Casola, 2018; Vakirlis et al., 2020). Esto indica que la neofuncionalización de elementos transcripcionalmente activos puede ser una fuente esencial de innovación evolutiva, generando así nuevos elementos funcionales, ya sea de genes codificantes a lincRNAs (Kapusta et al., 2013; Barrera-Redondo et al., 2019) o viceversa (Ruiz- Orera et al., 2014; Vakirlis et al., 2020). Los resultados comparativos con los genomas de calabazas y cucurbitáceas muestran que, en comparación con los genes codificantes, los lincRNAs tienen una alta tasa de recambio, incluyendo una alta tasa de duplicación y extinción (Barrera- Redondo et al., 2019). Esta alta tasa de recambio no es exclusiva de calabazas, ya que también se ha observado en otras plantas (Nelson et al., 2016) y en menor medida en tetrápodos (Necsulea et al., 2014), por lo que los lincRNAs pueden ser considerados una fuente importante de innovación evolutiva en los organismos (Ruiz-Orera et al., 2014). Podemos concluir que la importancia evolutiva de los lincRNAs no está atada únicamente a su función como efecto seleccionado (Graur et al., 2015), sino también a su potencial evolutivo de generar nuevos elementos funcionales en el genoma (Kapusta y Feschotte, 2014). 94 Además de poseer una alta tasa de recambio, los lincRNAs también tienen una alta tasa de sustitución (Ulitsky, 2016) debido a que la función de muchos de estos transcritos depende únicamente de su estructura secundaria o de su posición relativa a la de otros genes (Kapusta y Feschotte, 2014). De esta manera, los lincRNAs pueden perder rápidamente las señales de homología con otros transcritos con los que compartan ancestría común. La alta tasa de sustituciones de los lincRNAs puede llevarnos a inferir erróneamente que estos genes tienen un origen de novo (Kapusta y Feschotte, 2014) cuando en realidad pudieron surgir por eventos antiguos duplicación (Casola, 2018). Si además consideramos que los genes codificantes sin homología a otras proteínas pueden tener un origen en regiones de transcritos no codificantes (Casola, 2018; Vakirlis et al., 2020), entonces debemos replantear la posibilidad de que mucha de la evolución aparentemente “de novo” que observamos en los genomas, sean en realidad procesos de neofuncionalización a partir de elementos preexistentes con homología profunda, difícilmente detectable por métodos de similitud (Ruiz-Orera et al., 2014; Casola, 2018). Lo anterior apoya la idea de que las fuerzas evolutivas operan primordialmente modificando los recursos biológicos disponibles en sistemas imperfectos, pero con el potencial de llegar a ser altamente eficientes (Gould, 1994). Para comprender mejor estos procesos, es necesario realizar futuros estudios basados en sintenia y validación experimental que analicen a profundidad la evolución de los ARNs no codificantes con homología detectable a genes codificantes, en conjunto con los genes codificantes potencialmente de novo con aparición evolutiva reciente. El estudio de la domesticación a la luz de la genómica de poblaciones La genómica de poblaciones es útil para comprender el comportamiento de las innovaciones evolutivas bajo un contexto de interacción con el entorno de los organismos que las poseen (Jorde, 2001; Ross-Ibarra et al., 2007). Los procesos de domesticación a la luz de la genómica de poblaciones son muy útiles para 95 entender la dinámica evolutiva tanto de los elementos funcionales como de los elementos neutrales en los genomas (Ross-Ibarra et al., 2007), ya que estos procesos demográficos y selectivos han ocurrido recientemente en relación con otros procesos evolutivos (Purugganan y Fuller, 2009). Adicionalmente, el estudio del registro arqueológico y las prácticas de agricultura tradicional nos permiten entender el contexto ecológico en que se desencadenaron los procesos de domesticación (Casas et al., 1996; Fuller et al., 2014; Milla et al., 2015). Por lo tanto, el estudio de la domesticación de C. argyrosperma nos permitió estudiar a detalle la interacción entre su genoma y el ambiente, revelando tanto la historia demográfica como los patrones de selección en esta especie (Kantar et al., 2017; Barrera- Redondo et al., 2020a). Dado que las huellas históricas de la domesticación son recientes, diversas diciplinas como la arqueología, la ecología, la etnobotánica y la lingüística pueden aportar información valiosa al entendimiento de este proceso (Larson et al., 2014; Milla et al., 2015). Nuestros resultados de la demografía histórica de C. argyrosperma fueron congruentes y complementarios a la evidencia arqueológica que se tiene para esta especie, permitiéndonos elucidar un escenario claro de su domesticación (Barrera-Redondo et al., 2020a). Los análisis demográficos revelan que la domesticación de C. argyrosperma está íntimamente ligada a los patrones de migración y a los hábitos alimenticios de los humanos en Mesoamérica (Barrera-Redondo et al., 2020a). Dichos resultados fueron obtenidos mediante el análisis de la variación no funcional del genoma (Barrera-Redondo et al., 2020a), por lo que el estudio del ADN neutral es valioso para tener una comprensión global de los procesos evolutivos. A su vez, el entendimiento de las prácticas agrícolas actuales (Montes-Hernández et al., 2005) y las necesidades nutricionales de las primeras poblaciones humanas en Mesoamérica (Zizumbo-Villarreal et al., 2012) nos permitió contextualizar las señales de selección positiva que observamos a lo largo del genoma de C. argyrosperma (Barrera-Redondo et al., 2020a). 96 Nuestros resultados sugieren que las prácticas agrícolas tradicionales con las que se cosecha a C. argyrosperma promueven la diversidad genética y reducen el efecto de cuello de botella que se esperaría en un evento de domesticación (Montes-Hernández et al., 2005; Barrera-Redondo et al., 2020a). La congruencia entre los resultados obtenidos con marcadores genómicos y los resultados de otras diciplinas como la antropología (Zizumbo-Villarreal y Colunga-GarcíaMarín, 2010; Zizumbo-Villarreal et al., 2012) y la etnobotánica (Montes-Hernández et al., 2005) es evidencia del poder explicativo de la genética de poblaciones en el estudio de la evolución (Baedke et al., 2020). Algunas de las limitantes de la genómica de poblaciones en el entendimiento de la domesticación están basadas en su enfoque exclusivo hacia la variación genética (Baedke et al., 2020). Sin embargo, también existen limitantes tecnológicas y económicas que impiden un análisis más profundo de la domesticación por la genómica de poblaciones, como es el uso alternativas a la secuenciación completa de genomas que capturan un menor número de variantes genéticas a favor de una reducción en los costos de secuenciación (Schreiber et al., 2018). A pesar de que la reducción en los costos de secuenciación masiva ha revolucionado el estudio de la evolución (Eguiarte et al., 2013), muchas veces se tienen que recurrir a métodos de representación reducida del genoma para realizar estudios genómicos costeables a una escala poblacional. Como su nombre lo indica, estas técnicas de secuenciación reducen a porción total del genoma que es secuenciado para abaratar costos, ya sea mediante la secuenciación dirigida hacia la porción codificante del genoma (e.g., exomas y RNAseq) o mediante a secuenciación de regiones adyacentes a sitios de restricción (e.g., GBS y RADseq) (Schreiber et al., 2018). El uso de tGBS en C. argyrosperma eficientizó el costo de secuenciación por individuo, permitiendo el llamado de decenas de miles de variantes a lo largo del genoma de cientos de individuos con una alta confianza (Ott et al., 2017), con los cuales estimamos de manera precisa la historia demográfica de C. argyrosperma durante su domesticación (Barrera-Redondo et al., 2020a). Por otro lado, la búsqueda de barridos selectivos en C. argyrosperma fue fructífera pero 97 limitada, dado que las librerías de tGBS no son aconsejables para realizar este tipo de pruebas (Lowry et al., 2017). Adicionalmente, se encontró un bajo desequilibrio de ligamiento entre los SNPs de C. argyrosperma, volviendo problemática la búsqueda de barridos electivos (Barrera-Redondo et al., 2020a). Esta problemática se debe a que la densidad de SNPs que se pueden recobrar utilizando tGBS no es suficiente para abarcar la totalidad de haplotipos que componen al genoma. Por lo tanto, la búsqueda de barridos selectivos se limita a un subconjunto de todas las posibles señales de selección que existen en el genoma, dado que pueden existir haplotipos bajo selección que pudieran no haber sido secuenciados con tGBS. Además, esto limita la asociación entre una variable detectada con señales de selección y su posible papel como variante causal del barrido detectado, dado que múltiples loci en un mismo haplotipo pueden presentar la misma señal de selección por efectos de hitchhiking (Lowry et al., 2017). La secuenciación completa de genomas siempre será una alternativa superior al uso de representaciones reducidas del genoma para la búsqueda de señales de selección en el genoma, ya que se pueden secuenciar muchas más variantes a lo largo de cada cromosoma (Schreiber et al., 2018). Esto es particularmente útil cuando el desequilibrio de ligamiento se pierde rápidamente en los haplotipos y se pretenden encontrar las variantes causales de los barridos selectivos (Lowry et al., 2017). Sin embargo, los análisis de secuenciación completa de genomas a nivel poblacional también tienen sus limitaciones, como utilizar un solo genoma de referencia para realizar la búsqueda de variantes, además de los costos restrictivos que limitan el número de individuos y poblaciones secuenciadas (Golicz et al., 2016; Schreiber et al., 2018). Considero que el siguiente paso en los estudios de domesticación en calabazas con enfoques de genómica poblacional deberá basarse en el análisis de pan-genomas, donde cada individuo sea secuenciado y ensamblado de novo (Golicz et al., 2016). Los estudios pan-genómicos permitirán el análisis de las variantes estructurales y de presencia/ausencia que estén involucradas en los procesos de domesticación (Golicz et al., 2016; Zhao et al., 2018). El elevado costo 98 de esta aproximación lo vuelve una promesa en un futuro no muy cercano, cuando la secuenciación de tercera generación sea más barata y, con suerte, con una menor tasa de error (Golicz et al., 2016). Eventualmente será necesario llevar a cabo estudios enfocados a la domesticación de las calabazas desde otras líneas de investigación que no solo nos permitan estudiar la evolución del genoma y su correlación con los fenotipos, sino también entender los mecanismos de causalidad que dan origen a los fenotipos (Baedke et al., 2020). Un primer paso para ello serán los estudios de transcriptómica y metabolómica comparada entre calabazas domesticadas y sus parientes silvestres, con los cuales se puedan encontrar los patrones de expresión diferencial que liguen a los genotipos con los fenotipos (e.g., Hradilová et al., 2017). Los estudios de domesticación como herramienta biotecnológica El estudio de la domesticación no solo es relevante para comprender las bases de la evolución o la conexión entre los cultivos y la humanidad (Ross-Ibarra et al., 2007; Zizumbo-Villarreal y Colunga-GarcíaMarín, 2010). También es un área que genera conocimiento agronómicamente relevante y de una aplicación directa y relativamente sencilla (Bellón et al., 2009; Hufford et al., 2012; Fernie y Yan, 2019). En concreto, el estudio de la domesticación y diversidad genética de los cultivos es de particular relevancia para la agricultura, ya que nos ayuda a reconocer y entender los recursos fitogenéticos. Con esto nos referimos a describir la variación genética de los cultivos con valor agronómico para la seguridad alimenticia, tanto de las plantas domesticadas como de sus parientes silvestres (Hajjar y Hodgkin, 2007; Bellón et al., 2009). Un aspecto fundamental que influye en la conservación de los recursos fitogenéticos son las prácticas agrícolas bajo las que se siembran los cultivos (Jarvis et al., 2008). Se ha documentado en diversos cultivos del planeta que las prácticas agrícolas intensivas llevan a la extinción de variedades locales (Jarvis et al., 2008), lo cual erosiona la diversidad genética de las plantas domesticadas (Jarvis et al., 99 2008). Nuestras observaciones sugieren que las prácticas agrícolas tradicionales realizadas en C. argyrosperma (Montes-Hernández et al., 2005) promueven la diversidad genética en sus poblaciones domesticadas (Barrera-Redondo et al., 2020a), por lo que es importante salvaguardar estas prácticas tradicionales como reservas in situ de los recursos fitogenéticos (Jarvis et al., 2008). Por lo general, las plantas domesticadas pasan por cuellos de botella y barridos selectivos que reducen dramáticamente la diversidad genética de los cultivos, reduciendo su capacidad adaptativa y fijando mutaciones deletéreas (Moyers et al., 2018). Este proceso se ve acentuado durante los procesos de mejoramiento, ya que se forman cuellos de botella adicionales que agotan la diversidad genética de los cultivos (Moyers et al., 2018). Es importante señalar que las poblaciones domesticadas de C. argyrosperma no mostraron señales drásticas de pérdida de diversidad genética como resultado del cuello de botella (Sánchez-de la Vega et al., 2018; Barrera-Redondo et al., 2020a). Consideramos que esto se debe en parte a que esta especie no ha pasado aún por procesos de mejoramiento genético moderno (Lira et al., 2016), a diferencia de otros cultivos importantes como el maíz o el frijol (Moyers et al., 2018). Sin embargo, la atenuación del cuello de botella también puede estar dada por flujo genético constante con sus parientes silvestres (Barrera-Redondo et al., 2020a). Los parientes silvestres representan una diversidad genética más amplia a la que presentan las poblaciones domesticadas. Estas poblaciones silvestres preservan elementos funcionales que se perdieron durante los cuellos de botella que sufrieron los taxa domesticados (Hajjar y Hodgkin, 2007). Por lo tanto, no solo es relevante conservar la diversidad genética de las calabazas domesticadas, sino también realizar esfuerzos de conservación en los taxa silvestres, quienes fungen como un reservorio invaluable de diversidad genética aprovechable en los cultivos domesticados (Hajjar y Hodgkin, 2007). La búsqueda de genes candidatos asociados a la domesticación también resulta útil en el proceso del mejoramiento genético (Hufford et al., 2012). Se ha observado que los genes que fueron seleccionados durante el proceso de 100 domesticación también son blanco de los esfuerzos por generar variedades mejoradas en los cultivos, por lo que identificarlos resulta de interés agronómico (Hufford et al., 2012). Las señales de selección que encontramos en el genoma de C. argyrosperma revelaron genes candidatos que podrían estar asociados a caracteres de importancia agronómica como el tamaño del fruto, el tamaño de la semilla y el contenido de fosfolípidos en la semilla (Barrera-Redondo et al., 2020a). Estos resultados podrían guiar el eventual mejoramiento de los cultivos de C. argyrosperma por medio de selección asistida por marcadores o por la introgresión de alelos deseados con técnicas de edición genómica (Jiang, 2013; Zhou et al., 2019). Adicionalmente, se pueden realizar prácticas de domesticación de novo, en las cuales a una planta silvestre o semi-domesticada se le introducen los alelos responsables de los fenotipos de domesticación con métodos de edición genómica (Fernie y Yan, 2019). Esta es una manera creativa en que se puede conseguir un cultivo con caracteres deseables sin el problema de acarrear los costos de la domesticación como la pérdida de resistencia a plagas u otros caracteres deletéreos e indeseables (Fernie y Yan, 2019). Para que un proyecto de domesticación de novo sea exitoso, es esencial conocer las variantes causales de los caracteres fenotípicos deseables en una planta cultivada (Fernie y Yan, 2019). Por ello, es importante que los estudios de genes candidatos no solo estén basados en la asociación entre una variante y un fenotipo, sino que se realicen validaciones experimentales que demuestren una relación de causalidad (Zhang et al., 2017). Sin embargo, no hay que dejar que las expectativas tecnológicas de la ingeniería genética y de las domesticaciones de novo le resten importancia a la conservación in situ de los recursos fitogenéticos, ya que estos sientan las bases biológicas y culturales de los procesos de domesticación y mejoramiento (Bellón et al., 2009). 101 CONCLUSIONES 1. Se secuenciaron, ensamblaron y anotaron dos genomas de referencia a nivel de cromosoma de C. argyrosperma, uno para la subespecie domesticada y otro para la subespecie silvestre. Ambos sientan las bases para estudios evolutivos y agronómicos tanto en C. argyrosperma como en el resto del género Cucurbita. 2. Proponemos que la duplicación completa del genoma en el género Cucurbita produjo una aceleración en la tasa de evolución de familias de genes codificantes y ARNs largos no codificantes, ya que la redundancia funcional dentro del genoma facilitó la exaptación de elementos genéticos previamente existentes hacia nuevas funciones. 3. Durante el proceso de fraccionamiento y diploidización en calabazas, una proporción de los genes codificantes duplicados pasaron por procesos de pseudogenización y se propone que posteriormente se neofuncionalizaron como ARNs largos intergénicos no codificantes. 4. Los análisis de genómica poblacional revelaron que la domesticación de C. argyrosperma comenzó poco después de la llegada de los humanos a Mesoamérica, hace aproximadamente 13,800 años en Jalisco. Esto ocurrió bajo un flujo genético constante con su pariente silvestre, lo cual atenuó los efectos de cuello de botella durante la domesticación de la especie. 5. Detectamos señales de selección atribuibles al proceso de domesticación de C. argyrosperma, los cuales están asociados a genes que podrían explicar algunos aspectos del fenotipo de la subespecie domesticada como la pérdida de mecanismos de defensa, el crecimiento del fruto, el crecimiento de la semilla, la pérdida de la dormancia en la semilla y el contenido de lípidos en la semilla. 102 REFERENCIAS Amundson, R., Lauder, G. V. (1994). Function without purpose. Biology and philosophy 9: 443-469. Baedke, J., Fábregas-Tejeda, A., Vergara-Silva, F. (2020). Does the extended evolutionary synthesis entail extended explanatory power? Biology & Philosophy 35: 20. Barrera-Redondo, J., Ibarra-Laclette, E., Vázquez-Lobo, A., Gutiérrez-Guerrero, Y. T., de la Vega, G. S., Piñero, D., Montes-Hernández, S., Lira-Saade, R., Eguiarte, L. E. (2019). The genome of Cucurbita argyrosperma (silver-seed gourd) reveals faster rates of protein-coding gene and long noncoding RNA turnover and neofunctionalization within Cucurbita. Molecular Plant 12: 506-520. Barrera-Redondo, J., Sánchez-de la Vega, G., Aguirre-Liguori, J. A., Castellanos-Morales, G., Aguirre-Dugua, X., Aguirre-Planter, E., Gutiérrez- Guerrero, Y. T., Tenaillon, M., Montes-Hernández, S., Lira-Saade, R., Eguiarte, L. E. (2020a). The domestication patterns of Cucurbita argyrosperma are consistent with early migration events in Mesoamerica and general domestication syndromes in squashes. Under review. Barrera-Redondo, J., Lira-Saade, R., & Eguiarte, L. E. (2020b). Gourds and tendrils of Cucurbitaceae: how their shape diversity, molecular and morphological novelties evolved via whole-genome duplications. Molecular Plant 13: 1108-1110. Bartel, D. P. (2009). MicroRNAs: target recognition and regulatory functions. Cell 136: 215-233. Bellón, M. R., Barrientos-Priego, A. F., Colunga-GarcíaMarín, P., Perales, H., Reyes Agüero, J. A., Rosales-Serna, R., Zizumbo-Villarreal, D. (2009). Diversidad y conservación de recursos genéticos en plantas cultivadas. Capital natural de México 2: 355-382. Bennett, M. D., Smith, J. B. (1976). Nuclear DNA Amounts in Angiosperms. Philos. Trans. R. Soc. B Biol. Sci. 274: 227–274. Bisognin, D. A. (2002). Origin and evolution of cultivated cucurbits. Ciência Rural 32: 715–723. Blount, Z. D., Lenski, R. E., Losos, J. B. (2018). Contingency and determinism in evolution: Replaying life’s tape. Science 362: eaam5979. Brigandt, I., Love, A. C. (2012). Conceptualizing evolutionary novelty: moving beyond definitional debates. Journal of Experimental Zoology Part B: Molecular and Developmental Evolution 318: 417-427. 103 Casas, A., del Carmen Vázquez, M., Viveros, J. L., Caballero, J. (1996). Plant management among the Nahua and the Mixtec in the Balsas River Basin, Mexico: an ethnobotanical approach to the study of plant domestication. Human Ecology 24: 455-478. Casola, C. (2018). From de novo to “de nono”: the majority of novel protein- coding genes identified with phylostratigraphy are old genes or recent duplicates. Genome biology and evolution 10: 2906-2918. Castellanos-Morales, G., Paredes-Torres, L. M., Gámez, N., Hernández- Rosales, H. S., Sánchez-de la Vega, G., Barrera-Redondo, J., ... Eguiarte, L. E. (2018). Historical biogeography and phylogeny of Cucurbita: insights from ancestral area reconstruction and niche evolution. Molecular Phylogenetics and Evolution 128: 38-54. Chanderbali, A. S., Berger, B. A., Howarth, D. G., Soltis, D. E., Soltis, P. S. (2017). Evolution of floral diversity: genomics, genes and gamma. Philosophical Transactions of the Royal Society B: Biological Sciences 372: 20150509. Chekanova, J. A. (2015). Long non-coding RNAs and their functions in plants. Current opinion in plant biology 27: 207-216. Cheng, F., Wu, J., Cai, X., Liang, J., Freeling, M., Wang, X. (2018). Gene retention, fractionation and subgenome differences in polyploid plants. Nature Plants 4: 258-268. Chomicki, G., Schaefer, H., Renner, S. S. (2019). Origin and domestication of Cucurbitaceae crops: insights from phylogenies, genomics and archaeology. New Phytologist 226: 1240-1255. Chouard, T. (2010). Evolution: revenge of the hopeful monster. Nature 463: 864– 867. Clark J. W., Donoghue C. J. (2018). Whole-Genome Duplication and Plant Macroevolution. Trends in Plant Science 23: 933-945. Crombach, A., Hogeweg, P. (2008). Evolution of evolvability in gene regulatory networks. PLoS computational biology, 4: e1000112. Cummins, R. (1975). Functional analysis. The Journal of Philosophy 72: 741– 765. Diamond J. (2002). Evolution, consequences and future of plant and animal domestication. Nature 418: 700–707. Edger, P. P., Heidel-Fischer, H. M., Bekaert, M., Rota, J., Glöckner, G., Platts, A. E., ..., Hofberger, J. A. (2015). The butterfly plant arms-race escalated by gene and genome duplications. Proceedings of the National Academy of Sciences 112: 8362-8366. 104 Eguiarte, L. E., Aguirre-Liguori, J. A., Jardón-Barbolla, L., Aguirre-Planter, E., Souza, V. (2013). Genómica de poblaciones: nada en evolución va a tener sentido si no es a la luz de la genómica, y nada en genómica tendrá sentido si no es a la luz de la evolución. TIP Revista Especializada En Ciencias Químico- Biológicas, 16(1): 42-56. Esteras, C., Gómez, P., Monforte, A. J., Blanca, J., Vicente-Dólera, N., Roig, C., ... Picó, B. (2012). High-throughput SNP genotyping in Cucurbita pepo for map construction and quantitative trait loci mapping. BMC Genomics 13: 80. Fernie, A. R., Yan, J. (2019). De novo domestication: an alternative route toward new crops for the future. Molecular Plant 12: 615-631. Fuller, D. Q., Denham, T., Arroyo-Kalin, M., Lucas, L., Stevens, C. J., Qin, L., ... Purugganan, M. D. (2014). Convergent evolution and parallelism in plant domestication revealed by an expanding archaeological record. Proceedings of the National Academy of Sciences 111: 6147-6152. Futuyma, D. J. (2005). Evolution. Sinauer Associates, Massachusetts. 11p. Garcia-Mas, J., Benjak, A., Sanseverino, W., Bourgeois, M., Mir, G., González, V. M., ... Alioto, T. (2012). The genome of melon (Cucumis melo L.). Proceedings of the National Academy of Sciences 109: 11872-11877. Gepts, P. (2014). The contribution of genetic and genomic approaches to plant domestication studies. Curr. Opin. Plant Biol. 18: 51–59. Goldschmidt, R. (1933). Some aspects of evolution. Science 78: 539–547. Golicz, A. A., Batley, J., Edwards, D. (2016). Towards plant pangenomics. Plant Biotechnology Journal 14: 1099-1105. Gong, L., Stift, G., Kofler, R., Pachner, M., Lelley, T. (2008). Microsatellites for the genus Cucurbita and an SSR-based genetic linkage map of Cucurbita pepo L. Theor. Appl. Genet. 117: 37–48. Gould, S. J. (1994). El pulgar del panda: reflexiones sobre historia natural y evolución (No. 575.8 GOU). Graur, D., Zheng, Y., Azevedo, R. B. (2015). An evolutionary classification of genomic function. Genome biology and evolution 7: 642-645. Guo, S., Zhang, J., Sun, H., Salse, J., Lucas, W. J., Zhang, H., ... Min, J. (2013). The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions. Nature genetics 45: 51-58. Guo, J., Xu, W., Hu, Y., Huang, J., Zhao, Y., Zhang, L., Huang, C.-H., Ma, H. (2020). Phylotranscriptomics in Cucurbitaceae Reveal Multiple Whole Genome 105 Duplications and Key Morphological and Molecular Innovations. Molecular Plant 13: 1117–1133. Gustafson, P., Gong, L., Pachner, M., Kalai, K., Lelley, T. (2008). SSR-based genetic linkage map of Cucurbita moschata and its synteny with Cucurbita pepo. Genome 51: 878–887. Hajjar, R., Hodgkin, T. (2007). The use of wild relatives in crop improvement: a survey of developments over the last 20 years. Euphytica 156: 1-13. Hancock, J. F. (2005). Contributions of Domesticated Plant Studies to our Understanding of Plant Evolution. Ann. Bot. 96: 953–963. Hautmann, M. (2020). What is macroevolution? Frontiers in Palaeontology 63: 1- 11. Hedrick, P. W. (2011). Genetics of populations. Jones & Bartlett Learning. Hu, Z., Wang, W., Wu, Z., Sun, C., Li, M., Lu, J., Fu, B., Shi, J., Xu, J., Ruan, J., Wei, C. (2018). Novel sequences, structural variations and gene presence variations of Asian cultivated rice. Scientific Data 5: 180079. Hradilova, I., Trněný, O., Valkova, M., Cechova, M., Janska, A., Prokešová, L., ... Varshney, R. K. (2017). A combined comparative transcriptomic, metabolomic, and anatomical analyses of two key domestication traits: pod dehiscence and seed dormancy in pea (Pisum sp.). Frontiers in Plant Science 8: 542. Huang, S., Li, R., Zhang, Z., Li, L., Gu, X., Fan, W., ... Ren, Y. (2009). The genome of the cucumber, Cucumis sativus L. Nature Genetics 41: 1275-1281. Hufford, M. B., Xu, X., Van Heerwaarden, J., Pyhäjärvi, T., Chia, J. M., Cartwright, R. A., ... Lai, J. (2012). Comparative population genomics of maize domestication and improvement. Nature Genetics 44: 808-811. Jablonski, D. (2008). Species selection: theory and data. Annual review of ecology, evolution, and systematics 39: 501-524. Jarvis, D. I., Brown, A. H., Cuong, P. H., Collado-Panduro, L., Latournerie- Moreno, L., Gyawali, S., ... Hue, N. T. N. (2008). A global perspective of the richness and evenness of traditional crop-variety diversity maintained by farming communities. Proceedings of the National Academy of Sciences 105: 5326- 5331. Jiang, G. L. (2013). Molecular markers and marker-assisted breeding in plants. Plant breeding from laboratories to fields, 45-83. Jiao, Y., Zhao, H., Ren, L., Song, W., Zeng, B., Guo, J., ... Zhang, M. (2012). Genome-wide genetic changes during modern breeding of maize. Nature Genetics 44: 812-815. 106 Jorde, L. B. (2001). Population genomics: a bridge from evolutionary history to genetic medicine. Human Molecular Genetics 10: 2199–2207. Kantar, M. B., Nashoba, A. R., Anderson, J. E., Blackman, B. K., Rieseberg, L. H. (2017). The genetics and genomics of plant domestication. BioScience 67: 971-982. Kapusta, A., Kronenberg, Z., Lynch, V. J., Zhuo, X., Ramsay, L., Bourque, G., Yandell, M., Feschotte, C. (2013). Transposable elements are major contributors to the origin, diversification, and regulation of vertebrate long noncoding RNAs. PLoS genetics 9: e1003470. Kapusta, A., Feschotte, C. (2014). Volatile evolution of long noncoding RNA repertoires: mechanisms and biological implications. Trends in Genetics 30: 439- 452. Kauffman, S. A. (1993). The origins of order: Self-organization and selection in evolution. OUP USA. Keeling, D. M., Garza, P., Nartey, C. M., Carvunis, A. R. (2019). The meanings of 'function' in biology and the problematic case of de novo gene emergence. eLife 8: e47014. Kellis, M., Wold, B., Snyder, M. P., Bernstein, B. E., Kundaje, A., Marinov, G. K., ... Dunham, I. (2014). Defining functional DNA elements in the human genome. Proceedings of the National Academy of Sciences 111: 6131-6138. Kimura, M. (1968). Evolutionary rate at the molecular level. Nature 217: 624-626. Koenig, D., Jiménez-Gómez, J. M., Kimura, S., Fulop, D., Chitwood, D. H., Headland, L. R., ... Tohge, T. (2013). Comparative transcriptomics reveals patterns of selection in domesticated and wild tomato. Proceedings of the National Academy of Sciences 110: E2655-E2662. Lam, H. M., Xu, X., Liu, X., Chen, W., Yang, G., Wong, F. L., ... Li, J. (2010). Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection. Nature Genetics 42: 1053-1059. Larson, G., Piperno, D. R., Allaby, R. G., Purugganan, M. D., Andersson, L., Arroyo-Kalin, M., Barton, L., Climer Vigueira, C., Denham, T., Dobney, K., et al. (2014). Current perspectives and the future of domestication studies. Proceedings of the National Academy of Sciences 111: 6139–6146. Lee, D.H., Iwanski, G.B., Thoennissen, N.H. (2010). Cucurbitacin: ancient compound shedding new light on cancer treatment. Scientific World Journal 10: 413-418. Lewontin, R. C. Hubby, J. L. (1966). A molecular approach to the study of genic heterozygosity in natural populations. II. Amount of variation and degree of 107 heterozygosity in natural populations of Drosophila pseudoobscura. Genetics 54: 595-609. Li, Y. H., Zhao, S. C., Ma, J. X., Li, D., Yan, L., Li, J., ... Chang, R. Z. (2013). Molecular footprints of domestication and improvement in soybean revealed by whole genome re-sequencing. BMC Genomics 14: 1-12. Lin, T., Zhu, G., Zhang, J., Xu, X., Yu, Q., Zheng, Z., ... Huang, Z. (2014). Genomic analyses provide insights into the history of tomato breeding. Nature Genetics 46: 1220-1226. Lira, R., Eguiarte, L. E., Montes-Hernández, S. (2009). Proyecto Recopilación y análisis de la información existente de las especies de los géneros Cucurbita y Sechium que crecen y / o se cultivan en México. CONABIO, México, D.F. 107p. Lira, R., Eguiarte, L. E., Montes, S., Zizumbo-Villarreal, D., Colunga- GarcíaMarín, P., Quesada, M. (2016). Homo sapiens-Cucurbita interaction in Mesoamerica: Domestication, Dissemination and Diversification. In: Lira, R., Casas, A., Blancas, J. (eds.). Ethnobotany of Mexico. Springer-Verlag, New York, 389–402. ISBN: 978-1-4614-6669-7 Long, M., VanKuren, N. W., Chen, S., Vibranovski, M. D. (2013). New gene evolution: little did we know. Annual review of genetics 47: 307-333. Lowry, D. B., Hoban, S., Kelley, J. L., Lotterhos, K. E., Reed, L. K., Antolin, M. F., Storfer, A. (2017). Breaking RAD: An evaluation of the utility of restriction site‐ associated DNA sequencing for genome scans of adaptation. Molecular ecology resources 17: 142-152. Lynch, M., Walsh, B. (2007). The origins of genome architecture (Vol. 98). Sunderland, MA: Sinauer Associates. Mallet, J. (1995). A species definition for the modern synthesis. Trends in Ecology & Evolution 10: 294-299. Mattenberger, F., Sabater-Muñoz, B., Toft, C., Fares, M. A. (2017). The phenotypic plasticity of duplicated genes in Saccharomyces cerevisiae and the origin of adaptations. G3: Genes, Genomes, Genetics 7: 63-75. Meyer, R. S., Purugganan, M. D. (2013). Evolution of crop species: genetics of domestication and diversification. Nature Review Genetics 14: 840–52. Milla, R., Osborne, C. P., Turcotte, M. M., Violle, C. (2015). Plant domestication through an ecological lens. Trends in ecology & evolution 30: 463-469. Montero-Pau, J., Blanca, J., Bombarely, A., Ziarsolo, P., Esteras, C., Martí- Gómez, C., Ferriol, M., Gómez, P., Jamilena, M., Mueller, L., et al. (2018). De novo assembly of the zucchini genome reveals a whole genome duplication 108 associated with the origin of the Cucurbita genus. Plant Biotechnol. J. 12: 3218– 3221. Montes-Hernández, S., Merrick, L. C., Eguiarte, L. E. (2005). Maintenance of squash (Cucurbita spp.) landrace diversity by farmers' activities in Mexico. Genetic Resources and Crop Evolution 52: 697-707. Moyers, B. T., Morrell, P. L., McKay, J. K. (2018). Genetic costs of domestication and improvement. Journal of Heredity 109: 103-116. Necsulea, A., Soumillon, M., Warnefors, M., Liechti, A., Daish, T., Zeller, U., Baker, J. C., Grützner, F., Kaessmann, H. (2014). The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature 505: 635-640. Nee, M. (1990). The domestication of Cucurbita (Cucurbitaceae). Economic Botany 44: 56–68. Nelson, A. D. L., Shippen, D. E. (2015). Evolution of TERT-interacting lncRNAs: expanding the regulatory landscape of telomerase. Frontiers in Genetics 6: 1–6. Nelson, A. D. L., Forsythe, E. S., Devisetty, U. K., Clausen, D. S., Haug-Batzell, A. K., Meldrum, A. M., ... Beilstein, M. A. (2016). A genomic analysis of factors driving lincRNA diversification: lessons from plants. G3: Genes, Genomes, Genetics 6: 2881-2891. Nelson, A. D. L., Devisetty, U. K., Palos, K., Haug-Baltzell, A. K., Lyons, E., Beilstein, M. A. (2017). Evolinc: a tool for the identification and evolutionary comparison of long intergenic non-coding RNAs. Frontiers in Genetics 8: 1–12. Nunes, M. D., Arif, S., Schlötterer, C., McGregor, A. P. (2013). A perspective on micro-evo-devo: progress and potential. Genetics 195: 625-634. Ott, A., Liu, S., Schnable, J. C., Yeh, C. T. E., Wang, K. S., Schnable, P. S. (2017). tGBS® genotyping-by-sequencing enables reliable genotyping of heterozygous loci. Nucleic acids research 45: e178. Paris, H. S. (2016). Genetic Resources of Pumpkins and Squash, Cucurbita spp. Pp. 111–154. In: Grumet, R., Katzir, N., Garcia-Mas, J. (eds.). Genetics and Genomics of Cucurbitaceae. Springer, Gewerbestrasse 11, 6330 Cham, Switzerland. Pigliucci, M. (2010). Genotype–phenotype mapping and the end of the ‘genes as blueprint’ metaphor. Philosophical Transactions of the Royal Society B: Biological Sciences 365: 557-566. Piperno, D. R., Ranere, A. J., Holst, I., Iriarte, J., Dickau, R. (2009). Starch grain and phytolith evidence for early ninth millennium BP maize from the Central Balsas River Valley, Mexico. Proceedings of the National Academy of Sciences 106: 5019-5024. 109 Purugganan, M. D., Fuller, D. Q. (2009). The nature of selection during plant domestication. Nature 457: 843–848. Qi, J., Liu, X., Shen, D., Miao, H., Xie, B., Li, X., ... Du, Y. (2013). A genomic variation map provides insights into the genetic basis of cucumber domestication and diversity. Nature genetics 45: 1510. Qin, C., Yu, C., Shen, Y., Fang, X., Chen, L., Min, J., ... Yang, Y. (2014). Whole- genome sequencing of cultivated and wild peppers provides insights into Capsicum domestication and specialization. Proceedings of the National Academy of Sciences 111: 5135-5140. Rendón-Anaya, M., Ibarra-Laclette, E., Méndez-Bravo, A., Lan, T., Zheng, C., Carretero-Paulet, L., …, Farr, K. M. (2019). The avocado genome informs deep angiosperm phylogeny, highlights introgressive hybridization, and reveals pathogen-influenced gene space adaptation. Proceedings of the National Academy of Sciences 116: 17081-17089. Reznick, D. N., Ricklefs, R. E. (2009). Darwin's bridge between microevolution and macroevolution. Nature 457: 837-842. Romay, M. C., Millard, M. J., Glaubitz, J. C., Peiffer, J. A., Swarts, K. L., Casstevens, T. M., ... McMullen, M. D. (2013). Comprehensive genotyping of the USA national maize inbred seed bank. Genome biology 14: R55. Ross-Ibarra, J., Morrell, P. L., Gaut, B. S. (2007). Plant domestication, a unique opportunity to identify the genetic basis of adaptation. Proceedings of the National Academy of Sciences 104: 8641-8648. Ruiz-Orera, J., Messeguer, X., Subirana, J. A., Alba, M. M. (2014). Long non- coding RNAs as a source of new peptides. elife, 3: e03523. Sanjur, O. I., Piperno, D. R., Andres, T. C., Wessel-Beaver, L. (2002). Phylogenetic relationships among domesticated and wild species of Cucurbita (Cucurbitaceae) inferred from a mitochondrial gene: Implications for crop plant evolution and areas of origin. Proceedings of the National Academy of Sciences 99: 535-540. Schaefer, H., Heibl, C., Renner, S. S. (2009). Gourds afloat: a dated phylogeny reveals an Asian origin of the gourd family (Cucurbitaceae) and numerous oversea dispersal events. Proceedings of the Royal Society B: Biological Sciences 276: 843-851. Schlötterer, C. (2004). The evolution of molecular markers—just a matter of fashion? Nature reviews genetics 5(1): 63. 110 Schmutz, J., McClean, P. E., Mamidi, S., Wu, G. A., Cannon, S. B., Grimwood, J., ... Torres-Torres, M. (2014). A reference genome for common bean and genome-wide analysis of dual domestications. Nature genetics 46: 707-713. Schreiber, M., Stein, N., Mascher, M. (2018). Genomic approaches for studying crop evolution. Genome biology 19: 140. Shang, Y., Ma, Y., Zhou, Y., Zhang, H., Duan, L., Chen, H., ... Liu, M. (2014). Biosynthesis, regulation, and domestication of bitterness in cucumber. Science 346: 1084-1088. Shearwin, K. E., Callen, B. P., Egan, J. B. (2005). Transcriptional interference–a crash course. Trends in Genetics 21: 339-345. Singh, A.K. (1990). Cytogenetics and evolution in the Cucurbitaceae. In: Bates, D., Robinson, R., Jeffrey, C. (Eds.). Biology and Utilization of the Cucurbitaceae. Cornell University Press, Ithaca, New York. pp. 10-28. Šiško, M., Ivančič, A., Bohanec, B. (2003). Genome size analysis in the genus Cucurbita and its use for determination of interspecific hybrids obtained using the embryo rescue technique. Plant Sci. 165: 663–669. Smith, B. D. (1997). The Initial Domestication of Cucurbita pepo in the Americas 10,000 Years Ago. Science 276: 932–934. Tallamy, D. W., Hibbard, B. E., Clark, T. L., Gillespie, J. J. (2005). Western corn rootworm: ecology and management. In: Vidal, S., Kuhlmann, U., Edwards, C.R. (Eds.). Western corn rootworm: Ecology and management. CAB international, UK. pp. 67-94. Tan, S., Zhong, Y., Hou, H., Yang, S., Tian, D. (2012). Variation of presence/absence genes among Arabidopsis populations. BMC evolutionary biology 12: 86. Theißen, G. (2009). Saltational evolution: hopeful monsters are here to stay. Theory in Biosciences 128: 43–51. Ulitsky, I. (2016). Evolution to the rescue: using comparative genomics to understand long non-coding RNAs. Nature Reviews Genetics 17: 601. Vakirlis, N., Acar, O., Hsu, B., Coelho, N. C., Van Oss, S. B., Wacholder, A., … Parikh, S. B. (2020). De novo emergence of adaptive membrane proteins from thymine-rich genomic sequences. Nature Communications 11: 1-18. Van Oss, S. B., & Carvunis, A. R. (2019). De novo gene birth. PLoS genetics 15: e1008160. Wang, B., Zheng, C., Sankoff, D. (2011). Fractionation statistics. In: BMC bioinformatics (Vol. 12, No. S9, p. S5). BioMed Central. 111 Wang, J., Sun, P., Li, Y., Liu, Y., Yang, N., Yu, J., ... Ge, D. (2018). An overlooked paleotetraploidization in Cucurbitaceae. Molecular Biology and Evolution 35: 16- 26. Webster, G., Goodwin, B. (1984). A structuralist approach to morphology. Rivista di Biologia 77: 503-510. Weeden, N. (1984). Isozyme studies indicate that the genus Cucurbita is an ancient tetraploid. Cucurbit Genet. Coop. Rep. 7: 84-85 Weiling F. (1959). Genomanalytische Untersuchungen bei Kürbis (Cucurbita L.). Der Züchter, 29(4): 161–179. Whitaker, T. W. (1981). Archeological cucurbits. Econ. Bot. 35: 460–466. Whitaker, T. W. (1933). Cytological and Phylogenetic Studies in the Cucurbitaceae. Bot. Gaz. 94: 780–790. Williams, T. A., Nakjang, S., Campbell, S. E., Freeman, M. A., Eydal, M., Moore, K., Hirt, R. P., Embley, T. M., Williams, B. A. (2016). A Recent Whole-Genome Duplication Divides Populations of a Globally Distributed Microsporidian. Molecular biology and evolution 33: 2002–2015. Wong, K. K., deLeeuw, R. J., Dosanjh, N. S., Kimm, L. R., Cheng, Z., Horsman, D. E., MacAulay, C., Ng, R. T., Brown, C. J., Eichler, E. E., Lam, W. L. (2007). A comprehensive analysis of common copy-number variations in the human genome. The American Journal of Human Genetics 80: 91-104. Wright, L. (1973). Function. Philosophical Review 82: 139-168. Wu, S., Shamimuzzaman, M., Sun, H., Salse, J., Sui, X., Wilder, A., Wu, Z., Levi, A., Xu, Y., Ling, K-S., et al. (2017). The bottle gourd genome provides insights into Cucurbitaceae evolution and facilitates mapping of a Papaya ring-spot virus resistance locus. Plant J. 92: 963–975. Xia, X. (2013). What is Comparative Genomics? In: Comparative Genomics (pp. 1-20). Springer, Berlin, Heidelberg. Xie, D., Xu, Y., Wang, J., Liu, W., Zhou, Q., Luo, S., Huang, W., He, X., Li, Q., Peng, Q., Yang, X., Yuan, J., Yu, J., Wang, X., Lucas, W. J., Huang, S., Jiang, B., Zhang, Z. (2019). The wax gourd genomes offer insights into the genetic diversity and ancestral cucurbit karyotype. Nature Communications 10(1): 5158. Xu, X., Liu, X., Ge, S., Jensen, J. D., Hu, F., Li, X., ... Li, J. (2012). Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nature biotechnology 30: 105-111. Zeder, M.A. (2015). Core questions in domestication research. Proc. Natl. Acad. Sci. 112: 3191–3198. 112 Zhang, G., Ren, Y., Sun, H., Guo, S., Zhang, F., Zhang, J., Zhang, H., Jia, Z., Fei, Z., Xu, Y., Li, H. (2015). A high-density genetic map for anchoring genome sequences and identifying QTLs associated with dwarf vine in pumpkin (Cucurbita maxima Duch.). BMC Genomics 16: 1101. BMC Genomics. Zhang, H., Zhang, J., Lang, Z., Botella, J. R., Zhu, J. K. (2017). Genome editing — principles and applications for functional genomics research and crop improvement. Critical Reviews in Plant Sciences 36: 291-309. Zheng, Y. H., Alverson, A. J., Wang, Q. F., Palmer, J. D. (2013). Chloroplast phylogeny of Cucurbita: evolution of the domesticated and wild species. Journal of Systematics and Evolution 51: 326-334. Zhou, Y., Ma, Y., Zeng, J., Duan, L., Xue, X., Wang, H., ... & Zhang, S. (2016). Convergence and divergence of bitterness biosynthesis and regulation in Cucurbitaceae. Nature plants 2: 16183. Zhou, J., Li, D., Wang, G., Wang, F., Kunjal, M., Joldersma, D., Liu, Z. (2019). Application and future perspective of CRISPR/Cas9 genome editing in fruit crops. Journal of integrative plant biology 62: 269-286. Zizumbo-Villarreal, D., Colunga-GarcíaMarín, P. (2010). Origin of agriculture and plant domestication in West Mesoamerica. Genetic Resources and Crop Evolution 57: 813-825. Zizumbo-Villarreal, D., Flores-Silva, A., Marín, P. C. G. (2012). The archaic diet in Mesoamerica: incentive for milpa development and species domestication. Economic Botany 66: 328-343. Zraidi, A., Stift, G., Pachner, M., Shojaeiyan, A., Gong, L., Lelley, T. (2007). A consensus map for Cucurbita pepo. Mol. Breed. 20: 375–388. Zuckerkandl, E., Pauling, L. (1965). Molecules as documents of evolutionary history. Journal of Theoretical Biology 8: 357-366. 113 ANEXO 1: MATERIAL SUPLEMENTARIO DEL CAPÍTULO 1 Supplementary Material 1 Supplementary Figures and Tables TABLE S1 | Some examples of bottom-up tests designed to detect selective sweeps in populations using genomic data and their underlying models (modified from Vitti et al., 2013). See the main text to find the references cited in this table. Underlying model Test name Rationale FST outlier tests BayeScan This method estimates the correlation between allele frequencies (i.e., the genetic structure caused by demographic events) by calculating a multinomial-Dirichlet distribution using a Markov chain Monte Carlo Bayesian approach to subsequently detect outlier loci. This method assumes an island model (i.e., gene flow is equally frequent between each pair of populations) which can lead to false positives when used in populations with limited gene flow and strong population bottlenecks, as expected for many domesticated taxa (Foll and Gaggiotti, 2008). Excoffier, Hofer and Foll (EHF) test Uses coalescent simulations to calculate the distribution of genetic diversity within and between populations to obtain a null distribution of FST values and detect outliers. This method is optimized for hierarchically structured populations (i.e., genetic flow is more frequent within populations and between populations belonging to the same group such as domesticated or wild taxa) by using a hierarchical island model in which populations are assigned to groups a priori (Excoffier et al., 2009). TF-LK statistic This method calculates a kinship matrix between populations by generating a neighbor-joining population tree in order to obtain a null distribution to detect FST outliers. The statistic seems to be robust to complex demographic scenarios, reducing the emergence of false candidate loci (Bonhomme et al., 2010). BayeScEnv Uses the same approximation as BayeScan to detect FST outliers, but it also incorporates environmental cues to detect adaptation to local environment (de Villemereuil and Gaggiotti, 2015). PCAdapt This method calculates the underlying population structure using a principal component analysis and then detects candidate loci under selection using Mahalanobis distance. This method is particularly powerful when handling admixed individuals and hierarchical structure in the studied populations (Luu et al., 2017). 114 SFS based methods Tajima's D statistic Calculated from the comparison of the nucleotide diversity (π) against Watterson's Theta (θW). When π < θW, the region has an excess of low-frequency variants, suggesting purifying or positive selection, whereas when π > θW, the region has an excess of middle-frequency variants, suggesting balancing selection or a soft selective sweep (Tajima, 1989). Fai and Wu's H statistic Calculated from the comparison of π against θH or θL, which are estimators of theta weighted by the homozygosity of derived variants. When π < θH, the region has an excess of high- frequency variants, a characteristic signature of selective sweeps (Fay and Wu, 2000). Zeng et al.'s E statistic Calculated from the comparison of θL against θW, rendering it sensible to changes in high and low-frequency variants, which are signals of selective sweeps before and after the fixation of the locus under selection (Zeng et al., 2006). Reduction of diversity (ROD) test The selective sweeps associated to domestication will form a pattern where the domesticated taxon will have a significantly lower diversity in that region compared to the overall diversity in its genome, while the wild taxon will not show reduced genetic diversity in that locus (πwild > πdomesticate) (Guo et al., 2012). LD based methods Long-range haplotype (LRH) test This test uses the EHH statistic to detect whether an haplotype is inherited throughout the population without its disruption by recombination, suggesting that such haplotype is under positive selection (Sabeti et al., 2002). Whole-genome long-range haplotype (WGLRH) test The WGLRH test performs the LRH test throughout the entire genome using sliding windows to detect EHH outliers (Zhang et al., 2006). Long-range haplotype similarity (LRHs) test Calculates the similarity between homologous haplotypes by calculating an haplosimilarity score throughout the genome using sliding windows in order to detect haplotypes that contain alleles with low frequencies that are similar between each other, suggesting large haplotypes under a recent selective pressure that hasn’t been disrupted by recombination (Hanchard et al., 2006). Integrated haplotype score (iHS) The iHS compares the area under the curve defined by the EHH, which allows the identification of incomplete selective sweeps and soft sweeps throughout the genome (Voight et al., 2006). Cross- population extended haplotype homozygosity (XP-EHH) statistic This test compares the EHH in a locus between a population with a fixed haplotype against other populations where such locus remains polymorphic (e.g., domesticated and wild populations), allowing it to detect selective sweeps after the selected allele reached fixation (Sabeti et al., 2007). 115 LD decay (LDD) test The LDD test sorts individuals according to their homozygosity for each of the alleles found in any given SNP in the genome, and then calculates the fraction of heterozygous SNPs that are adjacent to each of the allelic variants in the SNP that is being evaluated. This way the LDD test can determine whether or not the LD decays significantly slower in one of the alleles of the evaluated SNP, suggesting a recent selective sweep. Since the test calculates the decay in LD only for the individuals that are homozygous in the SNP being evaluated, there is no necessity to obtain phased haplotypes (Wang et al., 2006). Regression- based test Calculates the reduction of heterozygosity as one approaches the locus under selection in a genome to infer selective sweeps (Wiener and Pong-Wong, 2011). OmegaPlus Implements the ω statistic by detecting regions with high SNP correlation coefficient across the genome to find regions under selection (Alachiotis et al., 2012). GIBDLD Calculates the identity-by-descent (IBD) between all pairs of individuals for each locus in the dataset to detect segments of IBD (i.e., haplotypes) that are shared between several unrelated pairs of individuals, suggesting the action of selective pressures (Han and Abney, 2013). Composite tests Cross- population composite likelihood ratio (XP-CLR) test The XP-CLR test searches for genomic regions with extended allele differentiation between a population under selective pressures (i.e., domesticated taxa) and a "control" population (i.e., a close wild relative) in order to detect the haplotypes where differentiation happened quicker than expected under neutrality (Chen et al., 2010). RAiSD This program uses the µ statistic to score genomic regions as candidate loci based on the joint analysis of LD, changes in the SFS and a reduction in the genetic diversity within sliding windows (Alachiotis and Pavlidis, 2018). 116 ANEXO 2: MATERIAL SUPLEMENTARIO DEL CAPÍTULO 2 Supplemental information The genome of Cucurbita argyrosperma (silver-seed gourd) reveals faster rates of protein-coding gene and long noncoding RNA turnover and neofunctionalization within Cucurbita Content Supplemental Methods ……………………………………………………………...117 Supplemental Figure 1-7 ……………………………………………………………123 Supplemental Table 1-4 ...…………………………………………………………...130 Supplemental References …………………………………………………………..134 117 Supplemental Methods Library construction and DNA/RNA sequencing Illumina library construction and sequencing were done at the Vincent J. Coates Genomics Sequencing Laboratory in UC Berkeley (NIH S10 Instrumentation Grants S10RR029668 and S10RR027303). C. argyrosperma genomic DNA was sheared using a Bioruptor Pico sonication device (Diagenode). Illumina libraries of 350 bp and 1,000 bp were constructed and sequenced on HiSeq2000 and MiSeq systems, respectively, using paired-end mode. PacBio RS II (P6/C4 chemistry) sequencing was done at the University of Washington PacBio Sequencing Services. Libraries were prepared and size- selected with fragments >8 kbp using the BluePippin system (Sage Science). Sequencing was done on 8 SMRT cells, using MagBeads to prioritize the sequencing of longer DNA fragments. For the C. argyrosperma transcriptome, Illumina ribo-depleted libraries (350 bp) were prepared for each RNA sample after cDNA transformation and shearing. All libraries were multiplexed and sequenced in a single HiSeq2000 lane using paired-end mode. Sequence read quality filters Raw Illumina reads were quality-filtered using the qualityControl.py script (https://github.com/Czh3/NGSTools), retaining only those read pairs that showed a PHRED quality ≥ 30 in at least 85% of its sequence and an average PHRED quality ≥ 25. We used SeqPrep (https://github.com/jstjohn/SeqPrep) to trim Illumina adapter 118 sequences and merge read pairs that showed at least 20 bp of overlap. PacBio subreads were obtained after filtering reads with a minimum length of 50 bp and a minimum quality score of 75. Organelle and nuclear genome assemblies The chloroplast genome of C. argyrosperma was assembled using the Illumina quality trimmed reads with the NOVOPlasty assembler (Dierckxsens et al., 2016). We assembled the chloroplast genome in two contigs using the chloro2 algorithm, a k-mer size of 39 and a chloroplast genome previously reported by Kistler et al. (2015) as a sequence seed (GenBank accession number KT898803.1). The chloroplast contigs were merged in a single scaffold using the PacBio reads and the scaffolding software SSPACE-longread (Boetzer and Pirovano, 2014). A final gap-filling and base correction step was done with Pilon (Walker et al., 2014) by mapping the Illumina quality trimmed reads using BWA mem (Li, 2013) to the chloroplast scaffold. We filtered the reads belonging to the organelle genomes before nuclear genome assembly. Illumina quality trimmed reads were filtered by mapping them to the assembled chloroplast genome and the previously assembled C. pepo mitochondrial genome (Alverson et al. 2010, GenBank accession number GQ856148.1) using Hisat2 (Kim et al., 2015). PacBio data was filtered by mapping the reads to the organelle genomes using BlasR (Chaisson and Tesler, 2012). The PacBio reads that mapped to the C. pepo mitochondrion were used to assemble the C. argyrosperma mitochondrion genome using the Organelle-PBA pipeline (Soorni et al., 2017). The mitochondrial contigs were merged into scaffolds using the mitochondrial PacBio reads and the SSPACE-longread script. A final gap- 119 filling and base correction step was done to the mitochondrial scaffolds using the Illumina quality trimmed reads by doing 5 iterations of Pilon and BWA mem. The Illumina nuclear reads were assembled into contigs with the Platanus assembler (Kajitani et al., 2014), using an initial k-mer size of 32, a step size of 10 for k-mer extension and a bubble crush parameter of 0.1. The assembled contigs were used alongside the PacBio nuclear reads to construct an assembly backbone according to the DBG2OLC pipeline (Ye et al., 2016), considering a k-mer size of 17, a k-mer matching threshold of 5, an adaptive k-mer threshold of 0.01 and a minimum matching k-mer number of 30 between two PacBio reads. We used Minimap and Racon (Vaser et al., 2017) twice to map the Platanus contigs and the nuclear PacBio reads to the assembly backbone and obtain a consensus sequence assembly. We did three iterations of base correction by mapping the Illumina nuclear reads to the consensus sequence assembly using BWA mem and Pilon. Scaffolding was done with the paired-end information of the Illumina reads using BESST (Sahlin et al., 2014) and with the PacBio reads using SSPACE-longread. We did 30 iterations of gap closing with GapFiller (Boetzer and Pirovano, 2012), using Bowtie (Langmead et al., 2009) to map the Illumina reads to the assembled genome. Three final base corrections were done with BWA mem and Pilon. De novo and genome-guided transcriptome assemblies The transcriptome was assembled de novo using Trinity (Grabherr et al., 2011). DeconSeq (Schmieder and Edwards, 2011) was used to remove contigs with at least 90% identity to a database of contaminant sequences (bacteria, virus, fungi and 120 arthropods). Poly A tails were trimmed from the remaining contigs with SeqClean (Tae et al., 2012). The remaining contigs were reassembled using CAP3 (Huang and Madan, 1999) to merge fragmented transcripts. The transcriptome reads were also used to generate a genome-guided transcriptome assembly. We mapped the transcriptome reads of each organ to the genome assembly using Hisat2 (Kim et al., 2015). The mapped reads of each organ were assembled into transcripts using StringTie (Pertea et al., 2015) using a minimum coverage of 1, a minimum intron length of 20 bp and a maximum intron length of 500,000 bp. All transcripts were merged into a single genome-guided transcriptome assembly using Cuffmerge (Trapnell et al., 2012). Prediction of protein-coding gene models We searched for ORFs within the de novo assembled transcriptome to identify coding transcripts using AlignWise (Evans and Loose, 2015). The proteins of the predicted ORFs were compared to the proteomes of Citrullus lanatus (Guo et al., 2012), Cucumis sativus (Huang et al., 2009) and Arabidopsis thaliana (The Arabidopsis Genome Initiative, 2000) with BLASTp (Camacho et al., 2009). We considered as complete or nearly complete proteins those that had an alignment coverage of at least 70% compared to its best BLAST hit (6,987 proteins) and used them to train AUGUSTUS (Stanke et al., 2006) and obtain a hidden Markov model (HMM) file for gene prediction. MAKER3 (Cantarel et al., 2008) was used to obtain protein-coding gene models in the C. argyrosperma genome assembly. We used the repeat library obtained from REPET to mask TEs with RepeatMasker, as well as RepeatRunner 121 (Smith et al., 2007) to mask transposable element proteins. GeneMark-ES (Lomsadze et al., 2005) was used to obtain a self-trained HMM file from the masked genome. We used AUGUSTUS and GeneMark-ES HMM files for ab initio gene prediction. Both the de novo and the genome-guided transcriptome assemblies were used as transcriptomic evidence. The SwissProt database (downloaded on July 13, 2017) was used as protein homology evidence. We used EvidenceModeler (Haas et al., 2008) within MAKER3 to obtain additional gene models from the weighted evidence of ab initio predictions, transcript evidence and protein homology evidence. The gene models obtained from MAKER3 were used to train SNAP (Korf, 2004) and obtain a HMM file for the second round of gene model prediction. We used MAKER3 for a second round of gene model prediction, incorporating the SNAP HMM training file, the previous MAKER3 models as an annotation pass- through and using the proteomes of Arabidopsis thaliana (The Arabidopsis Genome Initiative, 2000), Cucumis sativus (Huang et al., 2009), Cucumis melo (Garcia-Mas et al., 2012), Theobroma cacao (Argout et al., 2011), Vitis vinifera (Jaillon et al., 2007), Fragaria vesca (Edger et al., 2018) and Citrus sinsensis (Xu et al., 2012) as additional protein homology evidence. We incorporated tRNAscan-SE (Lowe and Eddy, 1997) within the MAKER3 pipeline to predict tRNA genes. Phylogenetic analyses We detected single-copy ortholog genes that were conserved in all the analyzed species, that is, gene families where every species had just one representative gene. The aligned sequences of the single-copy orthologs were concatenated and used to obtain a species phylogeny with PhyML (Guindon et al., 2010). We used the Akaike 122 Information Criterion within SMS (Lefort et al., 2017) to determine the best amino acid substitution model for our sequence alignment: JTT matrix (Jones et al., 1992) with a gamma shape parameter (G = 0.906) of four categories and a proportion of invariant sites (I = 0.081). We used BioNJ for the initial tree with an SPR search for tree topology and performed aLRT to test branch support at each node (Ansimova and Gascuel, 2006). To obtain a dated phylogeny, we used a Bayesian Markov Chain Monte Carlo (MCMC) approach with approximate likelihood calculation, as implemented in mcmctree (Yang, 2007). We incorporated the ages of Protofagacea (Herendeen et al., 1995) and Antiquacupula (Sims et al., 1998) as a minimum date of divergence for the root of the tree (84 Mya), since they can be unambiguously assigned to the Fagales Order, thereby functioning as a minimum date of divergence between Cucurbitales and Fagales (Wikstrom et al., 2001). We also used the estimated age of the origin of core eudicots (115 Mya) as a maximum time of divergence for the root of the tree (Chang et al., 2004). We ran two independent MCMC analyses for 11,000,000 generations, sampling every 500 generations and specifying an initial burn-in of 1,000,000 generations. We confirmed convergence of the two chains using Tracer v1.6 (Rambaut et al., 2014). Both mcmctree trace files are available in Supplemental Data 3. 123 Supplemental Figures Supplemental Figure 1. Species-exclusive protein-coding gene families from all the analyzed taxa. The species-exclusive gene families were discarded from the gene family expansion/contraction analyses. The remaining 11,961 gene families (grey circle) were shared between at least two species and were used to calculate gene birth-death rates. 124 Supplemental Figure 2. Likelihood ratio test between a single gene birth-death rate (lambda) throughout the tree and a change in lambda within the Cucurbita genus. After calculating the observed likelihood ratio between a single lambda and two lambdas [2*((-198328.178461)-(-193746.504245))], 100 simulations were made to obtain a null distribution and asses if the observed likelihood ratio could be explained by chance. None of the simulated likelihood ratios were lower than the observed likelihood ratio (-9163.35), yielding a significantly low p-value (p < 0.01). 125 Supplemental Figure 3. Likelihood ratio test between a single gene birth-death rate (lambda) throughout the tree and a change in lambda within the Cucurbita genus using high-confidence protein-coding gene models. After calculating the observed likelihood ratio between a single lambda and two lambdas [2*((- 169502.088320)-(-165841.652871))], 100 simulations were made to obtain a null distribution and asses if the observed likelihood ratio could be explained by chance. None of the simulated likelihood ratios were lower than the observed likelihood ratio (-7320.87), yielding a significantly low p-value (p < 0.01). 126 Supplemental Figure 4. Dated phylogeny of the Cucurbitaceae family with protein-coding gene family expansions and contractions per branch after discarding low-quality protein-coding gene models. Fossil evidence was used to calibrate the basal node of the tree (green hexagon). The pie charts and the percentages at every branch of the tree indicate whether a gene family expanded (red), contracted (blue) or remained the same size (gray). The yellow stars indicate the estimated ages of whole-genome duplication events within the phylogeny. The black arrows indicate the change in the basal gene birth/death rate of the phylogeny after the whole-genome duplication in Cucurbita. 127 Supplemental Figure 5. Pattern of lincRNA conservation in the outgroup species, Juglans regia and Fragaria vesca. Only a few lincRNA loci were conserved between Cucurbita argyrosperma and the outgroup species. All of these lincRNAs show complex evolutionary patterns and none of them are conserved in all the analyzed species. 128 Supplemental Figure 6. Duplication of lincRNA family associated to the whole- genome duplication in Cucurbita (red diamond). The lincRNA family was predicted using C. lanatus “Clan_TCONS_00047735” gene as query with Evolinc-II (Nelson et al., 2017). Some lincRNAs within the family were independently predicted by homology using Evolinc-II (triangles in terminal nodes) and with transcriptomic evidence using Evolinc-I (circles in terminal nodes), further supporting the transcriptional activity of such genes. The colors at the terminal nodes indicate which gene belongs to each species. 129 Supplemental Figure 7. Secondary structure of the protein-coding transcript Carg19464 (trafficking protein particle complex subunit 11). A) Minimum free energy (MFE) structural prediction. B) RNA base-pairing probability matrix showing the MFE structural prediction (below the diagonal) and all the possible suboptimal pairings (above the diagonal). Higher probabilities are represented as larger dots within the matrix. 130 Supplemental Tables Supplemental Table 1. Transposable elements within the genome of Cucurbita argyrosperma, classified according to Wicker’s classification system (Wicker et al., 2007). Class Subclass Order Number of elements Number of bp percentage (genome assembly) DNA transposons Subclass-I (autonomous) TIR 787 856472 0.3743 Crypton 405 205905 0.0899 Subclass-II (autonomous) Helitron 326 336888 0.1472 Maverick 65 107728 0.0471 Non-autonomous DNA transposons MITE 32 17023 0.0074 Retrotransposons Autonomous retrotransposons LINE 1507 974219 0.4257 LTR 32222 38316833 16.7458 SINE 34 7161 0.0031 DIRS 5683 7746666 3.3855 Non-autonomous retrotransposons LARD 42230 22917378 10.0157 TRIM 3830 2868582 1.2536 Unclassifiable retrotransposon NA 273 231598 0.1012 Unclassifiable transposon NA NA 248 106628 0.0466 Unknown repetitive element NA NA 2572 3364279 1.4703 TOTAL NA NA 90214 78057360 34.1138 131 Supplemental Table 2. Assembly size and gene content of all the analyzed genomes. Group Species Assembly size Protein-coding genes High-quality gene models* Predicted lincRNAs Cucurbita spp. Cucurbita argyrosperma 229 Mbp 28,298 26,603 6,124 Cucurbita moschata 270 Mbp 32,205 26,425 998 Cucurbita maxima 271 Mbp 32,076 26,843 1,054 Cucurbita pepo 263 Mbp 27,867 27,017 3,272 Other cucurbits (Cucurbitac eae) Cucumis sativus 243 Mbp 26,682 NA 3,488 Cucumis melo 375 Mbp 27,427 NA 153 Citrullus lanatus 353 Mbp 23,440 21,244 5,549 Lagenaria siceraria 313 Mbp 22,472 18,903 2,446 Momordica charantia 285 Mbp 23,514 NA 1,062 Outgroups Juglans regia 667 Mbp 32,496 30,994 NA Fragaria vesca 220 Mbp 28,588 23,700 NA * Gene predictions with Annotation Edit Distances lower than 0.5. 132 Supplemental Table 3. SRA accessions and sequenced organs of the RNAseq data used to predict lincRNA genes for each species using Evolinc-I (Nelson et al., 2017). Species leaves roots flowers stems fruits tendrils multiple pooled organs Cucurbita argyrosperma SRR7685407 SRR7685405 SRR7685400 SRR7685406 NA SRR7685404 NA Cucurbita maxima SRR5504358 SRR5504359 SRR5504360 SRR5504353 SRR5504354 SRR5504355 NA SRR5504356 SRR5504357 SRR5504361 SRR5504362 SRR5504363 NA NA Cucurbita moschata SRR5504348 SRR5504349 SRR5504344 SRR5504343 SRR5504342 NA SRR5504345 SRR5504346 SRR5504347 SRR5504350 SRR5504351 SRR5504352 NA NA Cucurbita pepo NA NA NA NA NA NA SRR091276 SRR091277 SRR1182476 SRR1182477 Cucumis melo NA NA NA NA NA NA SRR411100 SRR411102 SRR411103 SRR411104 SRR411105 SRR411106 Cucumis sativus SRR351906 SRR351499 SRR351908 SRR351912 SRR351905 SRR351476 SRR351489 SRR351495 SRR351910 SRR351911 NA Citrullus lanatus SRR3156549 SRR3706796 SRR2033940 SRR2033941 SRR494474 SRR494479 SRR4046409 SRR4046416 SRR4046425 NA NA Lagenaria siceraria SRR5590272 SRR5590273 SRR5590270 SRR5590274 SRR5590271 NA NA 133 Supplemental Table 4. Dinucleotide content within a lincRNA and its protein- coding homolog. lincRNA (Carg_TCONS_00015392) protein-coding homolog (Carg19464) Dinucleotide Count Frequency Count Frequency AA 16 0.0740741 365 0.0856405 AC 12 0.0555556 204 0.0478649 AG 13 0.0601852 300 0.0703895 AT 14 0.0648148 331 0.0776631 CA 12 0.0555556 290 0.0680432 CC 6 0.0277778 226 0.0530267 CG 8 0.037037 86 0.0201783 CT 15 0.0694444 316 0.0741436 GA 17 0.0787037 323 0.075786 GC 8 0.037037 187 0.0438761 GG 8 0.037037 180 0.0422337 GT 18 0.0833333 208 0.0488034 TA 10 0.0462963 222 0.0520882 TC 14 0.0648148 301 0.0706241 TG 23 0.1064815 332 0.0778977 TT 22 0.1018519 391 0.091741 134 Supplemental References Alverson, A. J., Wei, X., Rice, D. W., Stern, D. B., Barry, K., and Palmer, J. D. (2010). Insights into the Evolution of Mitochondrial Genome Size from Complete Sequences of Citrullus lanatus and Cucurbita pepo (Cucurbitaceae). Mol. Biol. Evol. 27:1436–1448. Ansimova, M., and Gascuel, O. (2006). Approximate Likelihood-Ratio Test for branches: a fast, accurate, and powerful alternative. Syst. Biol. 55:539–552. Argout, X., Salse, J., Aury, J.-M., Guiltinan, M. J., Droc, G., Gouzy, J., Allegre, M., Chaparro, C., Legavre, T., Maximova, S. N., et al. (2011). The genome of Theobroma cacao. Nat. Genet. 43:101–8. Boetzer, M., and Pirovano, W. (2012). Toward almost closed genomes with GapFiller. Genome Biol. 13:R56. Boetzer, M., and Pirovano, W. (2014). SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics 15:211. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T. L. (2009). BLAST plus : architecture and applications. BMC Bioinformatics 10. Cantarel, B. L., Korf, I., Robb, S. M. C., Parra, G., Ross, E., Moore, B., Holt, C., Alvarado, A. S., and Yandell, M. (2008). MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18:188–196. Chaisson, M. J., and Tesler, G. (2012). Mapping single molecule sequencing reads 135 using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13:238. Chang, C.-C., Chen, H.-L., Li, W.-H., and Chaw, S.-M. (2004). Dating the Monocot- Dicot Divergence and the Origin of Core Eudicots Using Whole Chloroplast Genomes. J. Mol. Evol. 58:424–441. Dierckxsens, N., Mardulyn, P., and Smits, G. (2016). NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 45:gkw955. Edger, P. P., VanBuren, R., Colle, M., Poorten, T. J., Wai, C. M., Niederhuth, C. E., Alger, E. I., Ou, S., Acharya, C. B., Wang, J., et al. (2018). Single-molecule sequencing and optical mapping yields an improved genome of woodland strawberry (Fragaria vesca) with chromosome-scale contiguity. Gigascience 7:1–7. Evans, T., and Loose, M. (2015). AlignWise: a tool for identifying protein-coding sequence and correcting frame-shifts. BMC Bioinformatics 16:376. Garcia-Mas, J., Benjak, A., Sanseverino, W., Bourgeois, M., Mir, G., Gonzalez, V. M., Henaff, E., Camara, F., Cozzuto, L., Lowy, E., et al. (2012). The genome of melon (Cucumis melo L.). Proc. Natl. Acad. Sci. 109:11872–11877. Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., et al. (2011). Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29:644–652. Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W., and Gascuel, O. (2010). New algorithms and methods to estimate Maximum- 136 Likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59:307-321. Guo, S., Zhang, J., Sun, H., Salse, J., Lucas, W. J., Zhang, H., Zheng, Y., Mao, L., Ren, Y., Wang, Z., et al. (2012). The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions. Nat. Genet. 45:51–58. Haas, B. J., Salzberg, S. L., Zhu, W., Pertea, M., Allen, J. E., Orvis, J., White, O., Buell, C. R., and Wortman, J. R. (2008). Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9:R7. Herendeen, P. S., Crane, Peter, R., and Drinnan, A. N. (1995). Fagaceous flowers, fruits, and cupules from the Campanian (late Cretaceous) of central Georgia, USA. Int. J. Plant Sci. 156:93–116. Huang, S., Li, R., Zhang, Z., Li, L., Gu, X., Fan, W., Lucas, W. J., Wang, X., Xie, B., Ni, P., et al. (2009). The genome of the cucumber, Cucumis sativus L. Nat. Genet. 41:1275–1281. Huang, X., and Madan, A. (1999). CAP3: A DNA Sequence Assembly Program. Genome Res. 9:868–877. Jaillon, O., Aury, J.-M., Noel, B., Policriti, A., Clepet, C., Casagrande, A., Choisne, N., Aubourg, S., Vitulo, N., Jubin, C., et al. (2007). The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449:463–7. Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). The rapid generation of mutation data matrices from protein sequences. Computer Applications in the Biosciences 8:275-282. 137 Kajitani, R., Toshimoto, K., Noguchi, H., Toyoda, A., Ogura, Y., Okuno, M., Yabana, M., Harada, M., Nagayasu, E., Maruyama, H., et al. (2014). Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24:1384–1395. Kim, D., Langmead, B., and Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12:357–360. Kistler, L., Newsom, L. A., Ryan, T. M., Clarke, A. C., Smith, B. D., and Perry, G. H. (2015). Gourds and squashes (Cucurbita spp.) adapted to megafaunal extinction and ecological anachronism through domestication. Proc. Natl. Acad. Sci. 112:15107–15112. Korf, I. (2004). Gene finding in novel genomes. BMC Bioinformatics 5:59. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10:R25. Lefort, V., Longueville, J-E., and Gascuel, O. (2017). SMS: Smart Model Selection in PhyML. Molecular Biology and Evolution, 34:2422-2424. Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv Prepr. arXiv 00:3. Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O., and Borodovsky, M. (2005). Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 33:6494–6506. Lowe, T. M., and Eddy, S. R. (1997). tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25:955–964. 138 Nelson, A. D. L., Devisetty, U. K., Palos, K., Haug-Baltzell, A. K., Lyons, E., and Beilstein, M. A. (2017). Evolinc: A Tool for the Identification and Evolutionary Comparison of Long Intergenic Non-coding RNAs. Front. Genet. 8:1–12. Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T.-C., Mendell, J. T., and Salzberg, S. L. (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33:290–295. Rambaut, A., Suchard, M., Xie, D., and Drummond, A. (2014). Tracer 1.6, Available at http://tree.bio.ed.ac.uk/software/tracer/. Sahlin, K., Vezzi, F., Nystedt, B., Lundeberg, J., and Arvestad, L. (2014). BESST - Efficient scaffolding of large fragmented assemblies. BMC Bioinformatics 15:281. Schmieder, R., and Edwards, R. (2011). Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets. PLoS One 6:e17288. Sims, H. J., Herendeen, P. S., and Crane, Peter, R. (1998). New genus of fossil Fagaceae from the Santonian (Late Cretaceous) of central Georgia, U.S.A. Int. J. Plant Sci. 159:391–404. Smith, C. D., Edgar, R. C., Yandell, M. D., Smith, D. R., Celniker, S. E., Myers, E. W., and Karpen, G. H. (2007). Improved repeat identification and masking in Dipterans. Gene 389:1–9. Soorni, A., Haak, D., Zaitlin, D., and Bombarely, A. (2017). Organelle_PBA, a pipeline for assembling chloroplast and mitochondrial genomes from PacBio DNA sequencing data. BMC Genomics 18:49. Stanke, M., Schöffmann, O., Morgenstern, B., and Waack, S. (2006). Gene 139 prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62. Tae, H., Ryu, D., Sureshchandra, S., and Choi, J.-H. (2012). ESTclean: a cleaning tool for next-gen transcriptome shotgun sequencing. BMC Bioinformatics 13:247. The Arabidopsis Genome Initiative (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–815. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L., and Pachter, L. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7:562–578. Vaser, R., Sovic, I., Nagarajan, N., and Sikic, M. (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. Advance Access published January 18, 2017, doi:10.1101/gr.214270.116. Walker, B. J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S., Cuomo, C. A., Zeng, Q., Wortman, J., Young, S. K., et al. (2014). Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS One 9:e112963. Wicker, T., Sabot, F., Hua-Van, A., Bennetzen, J. L., Capy, P., Chalhoub, B., Flavell, A., Leroy, P., Morgante, M., Panaud, O., et al. (2007). A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8:973–982. Wikstrom, N., Savolainen, V., and Chase, M. W. (2001). Evolution of the angiosperms: calibrating the family tree. Proc. R. Soc. B Biol. Sci. 268:2211– 140 2220. Xu, Q., Chen, L.-L., Ruan, X., Chen, D., Zhu, A., Chen, C., Bertrand, D., Jiao, W.- B., Hao, B.-H., Lyon, M. P., et al. (2012). The draft genome of sweet orange (Citrus sinensis). Nat. Genet. 45:59–66. Yang, Z. (2007). PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol. Biol. Evol. 24:1586–1591. Ye, C., Hill, C. M., Wu, S., Ruan, J., and Ma, Z. (Sam) (2016). DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Sci. Rep. 6:31900. 141 ANEXO 3: MATERIAL SUPLEMENTARIO DEL CAPÍTULO 3 Supplementary information for: The domestication patterns of Cucurbita argyrosperma are consistent with early migration events in Mesoamerica and general domestication syndromes in squashes. Josué Barrera-Redondo, Guillermo Sánchez-de la Vega, Jonás A. Aguirre-Liguori, Gabriela Castellanos-Morales, Xitlali Aguirre-Dugua, Erika Aguirre-Planter, Yocelyn T. Gutiérrez- Guerrero, Maud Tenaillon, Salvador Montes-Hernández, Rafael Lira-Saade, Luis E. Eguiarte Contact: fruns@unam.mx and rlira@unam.mx This PDF file includes: Supplemental Tables S1: Assembly metrics of C. argyrosperma subsp. sororia genome. S2: Information of 192 individuals sequenced with tGBS. S3: Genetic diversity of each C. argyrosperma population. S4: Strong candidate SNPs. S5: Candidate genes under selection. S6: GO enrichment results S7: Cucurbitacin-related genes in C. argyrosperma. Supplemental Figures S1: Gene synteny dot plots between Cucurbita genomes. S2-S21: Nucleotide synteny dot plots for each chromosome. S22: LD decay in C. argyrosperma. S23: Synteny dot plot between proteins and pseudogenes. 142 Supplemental Tables Table S1. Assembly metrics of the reference genome of Cucurbita argyrosperma subsp. sororia. Assembly size 253,587,946 bp No. of contigs 828 Longest contig 4,976,248 bp N50 1,323,288 bp L50 58 contigs No. of contigs > 1 kbp 828 (100.0%) No. of contigs > 10 kbp 810 (97.8%) No. of contigs > 100 kbp 292 (35.3%) CG content 36.56% Illumina read coverage 213x PacBio read coverage 75.4x No. of genes 31,452 Average gene size 3,269 bp Complete BUSCOs 92.80% Fragmented BUSCOs 1.20% Missing BUSCOs 6.00% 143 Table S2. Information of 192 individuals sequenced using tGBS libraries, including population name, geographical coordinates and SRA accession. (within population names: W = wild, D = domesticated) Individual ID Population name Population number Taxon Latitude Longitude SRA accession 129 Alamos, Sonora (Outgroup) 0 Cucurbita moschata (Outgroup) 27.02694 -108.93659 XXXXXX 131 Alamos, Sonora (Outgroup) 0 Cucurbita moschata (Outgroup) 27.02694 -108.93659 XXXXXX M1_145 Alamos, Sonora (Outgroup) 0 Cucurbita moschata (Outgroup) 27.02694 -108.93659 XXXXXX M2_146 Alamos, Sonora (Outgroup) 0 Cucurbita moschata (Outgroup) 27.02694 -108.93659 XXXXXX M3_147 Alamos, Sonora (Outgroup) 0 Cucurbita moschata (Outgroup) 27.02694 -108.93659 XXXXXX M4_148 Alamos, Sonora (Outgroup) 0 Cucurbita moschata (Outgroup) 27.02694 -108.93659 XXXXXX S_CHIS14 Jiquipilas, Chiapas (W) 1 Cucurbita argyrosperma subsp. sororia 16.597486 -93.626878 XXXXXX S_CHIS18 Jiquipilas, Chiapas (W) 1 Cucurbita argyrosperma subsp. sororia 16.597486 -93.626878 XXXXXX S_CHIS19 Jiquipilas, Chiapas (W) 1 Cucurbita argyrosperma subsp. sororia 16.597486 -93.626878 XXXXXX S_CHIS22 Jiquipilas, Chiapas (W) 1 Cucurbita argyrosperma subsp. sororia 16.597486 -93.626878 XXXXXX S_CHIS8 Jiquipilas, Chiapas (W) 1 Cucurbita argyrosperma subsp. sororia 16.597486 -93.626878 XXXXXX S_CHISA Jiquipilas, Chiapas (W) 1 Cucurbita argyrosperma subsp. sororia 16.597486 -93.626878 XXXXXX S_CHISB Jiquipilas, Chiapas (W) 1 Cucurbita argyrosperma subsp. sororia 16.597486 -93.626878 XXXXXX S_CHISC Jiquipilas, Chiapas (W) 1 Cucurbita argyrosperma subsp. sororia 16.597486 -93.626878 XXXXXX S_GRO40 Ometepec, Guerrero (W) 2 Cucurbita argyrosperma subsp. sororia 16.688139 -98.406319 XXXXXX S_GRO43 Ometepec, Guerrero (W) 2 Cucurbita argyrosperma subsp. sororia 16.688139 -98.406319 XXXXXX S_GRO46 Ometepec, Guerrero (W) 2 Cucurbita argyrosperma subsp. sororia 16.688139 -98.406319 XXXXXX S_GRO49 Ometepec, Guerrero (W) 2 Cucurbita argyrosperma subsp. sororia 16.688139 -98.406319 XXXXXX S_GRO51 Ometepec, Guerrero (W) 2 Cucurbita argyrosperma subsp. sororia 16.688139 -98.406319 XXXXXX S_OAX10 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX S_OAX11 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX S_OAX1 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX S_OAX2 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX 144 S_OAX3 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX S_OAX4 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX S_OAX5 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX S_OAX7 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX S_OAX8 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX S_OAX9 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.918225 -97.076308 XXXXXX S_MAC2 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.9528333 -97.0772778 XXXXXX S_MAC3 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.9528333 -97.0772778 XXXXXX S_MAC4 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.9528333 -97.0772778 XXXXXX S_MAC5 Puerto Escondido, Oaxaca (W) 3 Cucurbita argyrosperma subsp. sororia 15.9528333 -97.0772778 XXXXXX 22 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.682 -104.333278 XXXXXX 23 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.682 -104.333278 XXXXXX 90 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.7019583 -104.204314 XXXXXX 53 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9005833 -104.160222 XXXXXX 54 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9005833 -104.160222 XXXXXX 65 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9005833 -104.160222 XXXXXX 40 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9005833 -104.160222 XXXXXX 26 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.6997222 -104.203056 XXXXXX 27 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.6997222 -104.203056 XXXXXX 28 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.6997222 -104.203056 XXXXXX 88 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.8753889 -104.072333 XXXXXX 45 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9123056 -104.116833 XXXXXX 49 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9123056 -104.116833 XXXXXX 145 51 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9123056 -104.116833 XXXXXX 55 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9123056 -104.116833 XXXXXX 59 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9600833 -104.037472 XXXXXX 44 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.9574722 -103.988389 XXXXXX 87 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.8753889 -104.072333 XXXXXX 85 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.8356111 -104.081417 XXXXXX 86 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.8356111 -104.081417 XXXXXX 80 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.6909722 -104.360833 XXXXXX 83 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.6909722 -104.360833 XXXXXX 84 Jalisco (W) 4 Cucurbita argyrosperma subsp. sororia 19.6909722 -104.360833 XXXXXX 97 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 98 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 99 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 100 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 101 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 102 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 103 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 104 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 105 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 106 Tlapehuala, Guerrero (D) 5 Cucurbita argyrosperma subsp. argyrosperma 18.2416667 -100.534722 XXXXXX 77 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.7019583 -104.204314 XXXXXX 79 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.7019583 -104.204314 XXXXXX 32 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.6997222 -104.203056 XXXXXX 34 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.6997222 -104.203056 XXXXXX 15 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8753889 -104.072333 XXXXXX 17 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8753889 -104.072333 XXXXXX 56 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.9600833 -104.037472 XXXXXX 70 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8714722 -104.217333 XXXXXX 73 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8714722 -104.217333 XXXXXX 39 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.9574722 -103.988389 XXXXXX 146 1 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8356111 -104.081417 XXXXXX 2 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8356111 -104.081417 XXXXXX 3 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8356111 -104.081417 XXXXXX 4 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8356111 -104.081417 XXXXXX 6 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8356111 -104.081417 XXXXXX 7 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8356111 -104.081417 XXXXXX 19 Jalisco (D) 6 Cucurbita argyrosperma subsp. argyrosperma 19.8356111 -104.081417 XXXXXX 144 Badiraguato, Sinaloa (D) 7 Cucurbita argyrosperma subsp. argyrosperma 25.359173 -107.558408 XXXXXX MTP14 Matlalapa, Guerrero (D) 8 Cucurbita argyrosperma subsp. argyrosperma 17.5971639 -99.4579917 XXXXXX MTP1 Matlalapa, Guerrero (D) 8 Cucurbita argyrosperma subsp. argyrosperma 17.5971639 -99.4579917 XXXXXX MTP2 Matlalapa, Guerrero (D) 8 Cucurbita argyrosperma subsp. argyrosperma 17.5971639 -99.4579917 XXXXXX MTP3 Matlalapa, Guerrero (D) 8 Cucurbita argyrosperma subsp. argyrosperma 17.5971639 -99.4579917 XXXXXX MTP4 Matlalapa, Guerrero (D) 8 Cucurbita argyrosperma subsp. argyrosperma 17.5971639 -99.4579917 XXXXXX MTP5 Matlalapa, Guerrero (D) 8 Cucurbita argyrosperma subsp. argyrosperma 17.5971639 -99.4579917 XXXXXX MTP9 Matlalapa, Guerrero (D) 8 Cucurbita argyrosperma subsp. argyrosperma 17.5971639 -99.4579917 XXXXXX 92 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX SAH14 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX SAH1 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX SAH2 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX SAH3 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX SAH4_B01 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX SAH4_D07 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX SAH5 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX SAH6 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX SAH7 Sahuayo, Michoacán (D) 9 Cucurbita argyrosperma subsp. argyrosperma 20.0587194 -102.716233 XXXXXX 91 Salamanca, Guanajuato (D) 10 Cucurbita argyrosperma subsp. argyrosperma 20.5205778 -101.190992 XXXXXX SJI4_2 San José Iturbide, Guanajuato (D) 10 Cucurbita argyrosperma subsp. argyrosperma 20.9988889 -100.385 XXXXXX NAY2 Tepic, Nayarit (D) 11 Cucurbita argyrosperma subsp. argyrosperma 21.519956 -104.893423 XXXXXX NAY3 Tepic, Nayarit (D) 11 Cucurbita argyrosperma subsp. argyrosperma 21.519956 -104.893423 XXXXXX 147 NAY4 Tepic, Nayarit (D) 11 Cucurbita argyrosperma subsp. argyrosperma 21.519956 -104.893423 XXXXXX NAY5 Tepic, Nayarit (D) 11 Cucurbita argyrosperma subsp. argyrosperma 21.519956 -104.893423 XXXXXX 109 El Platanar, Sinaloa (feral) 12 feral individual 24.0303667 -106.432561 XXXXXX 110 El Platanar, Sinaloa (feral) 12 feral individual 24.0303667 -106.432561 XXXXXX 112 El Platanar, Sinaloa (feral) 12 feral individual 24.0303667 -106.432561 XXXXXX 113 El Platanar, Sinaloa (feral) 12 feral individual 24.0303667 -106.432561 XXXXXX 114 El Platanar, Sinaloa (feral) 12 feral individual 24.0303667 -106.432561 XXXXXX 93 El Platanar, Sinaloa (feral) 12 feral individual 24.0303667 -106.432561 XXXXXX 94 El Platanar, Sinaloa (feral) 12 feral individual 24.0303667 -106.432561 XXXXXX 95 Culiacán, Sinaloa (feral) 13 feral individual 24.817335 -107.416667 XXXXXX 96 Culiacán, Sinaloa (feral) 13 feral individual 24.817335 -107.416667 XXXXXX 107 Culiacán, Sinaloa (feral) 13 feral individual 24.817335 -107.416667 XXXXXX 108 Culiacán, Sinaloa (feral) 13 feral individual 24.817335 -107.416667 XXXXXX 121 Culiacán, Sinaloa (feral) 13 feral individual 24.817335 -107.416667 XXXXXX 115 Choix, Sinaloa (D) 14 Cucurbita argyrosperma subsp. argyrosperma 26.5967833 -108.335581 XXXXXX 116 Choix, Sinaloa (D) 14 Cucurbita argyrosperma subsp. argyrosperma 26.5967833 -108.335581 XXXXXX 117 Choix, Sinaloa (D) 14 Cucurbita argyrosperma subsp. argyrosperma 26.5967833 -108.335581 XXXXXX 118 Choix, Sinaloa (D) 14 Cucurbita argyrosperma subsp. argyrosperma 26.5967833 -108.335581 XXXXXX 119 Choix, Sinaloa (D) 14 Cucurbita argyrosperma subsp. argyrosperma 26.5967833 -108.335581 XXXXXX 138 Yecora, Sonora (D) 15 Cucurbita argyrosperma subsp. argyrosperma 28.3720417 -108.926986 XXXXXX 139 Yecora, Sonora (D) 15 Cucurbita argyrosperma subsp. argyrosperma 28.3720417 -108.926986 XXXXXX 140 Yecora, Sonora (D) 15 Cucurbita argyrosperma subsp. argyrosperma 28.3720417 -108.926986 XXXXXX 141 Yecora, Sonora (D) 15 Cucurbita argyrosperma subsp. argyrosperma 28.3720417 -108.926986 XXXXXX 142 Yecora, Sonora (D) 15 Cucurbita argyrosperma subsp. argyrosperma 28.3720417 -108.926986 XXXXXX 143 Yecora, Sonora (D) 15 Cucurbita argyrosperma subsp. argyrosperma 28.3720417 -108.926986 XXXXXX 128 Onavas, Sonora (feral) 16 feral individual 28.533333 -109.583333 XXXXXX 130 Onavas, Sonora (feral) 16 feral individual 28.533333 -109.583333 XXXXXX 132 Onavas, Sonora (feral) 16 feral individual 28.533333 -109.583333 XXXXXX 133 Onavas, Sonora (feral) 16 feral individual 28.533333 -109.583333 XXXXXX 134 Onavas, Sonora (feral) 16 feral individual 28.533333 -109.583333 XXXXXX 148 136 Onavas, Sonora (feral) 16 feral individual 28.533333 -109.583333 XXXXXX 137 Onavas, Sonora (feral) 16 feral individual 28.533333 -109.583333 XXXXXX DGO1 Durango, Durango (D) 17 Cucurbita argyrosperma subsp. argyrosperma 24.066667 -104.583333 XXXXXX TEH16 Tehuantepec, Oaxaca (D) 18 Cucurbita argyrosperma subsp. argyrosperma 16.3328306 -95.2330361 XXXXXX TEH1 Tehuantepec, Oaxaca (D) 18 Cucurbita argyrosperma subsp. argyrosperma 16.3328306 -95.2330361 XXXXXX TEH21 Tehuantepec, Oaxaca (D) 18 Cucurbita argyrosperma subsp. argyrosperma 16.3328306 -95.2330361 XXXXXX TEH2 Tehuantepec, Oaxaca (D) 18 Cucurbita argyrosperma subsp. argyrosperma 16.3328306 -95.2330361 XXXXXX TEH3_B07 Tehuantepec, Oaxaca (D) 18 Cucurbita argyrosperma subsp. argyrosperma 16.3328306 -95.2330361 XXXXXX TEH3_E04 Tehuantepec, Oaxaca (D) 18 Cucurbita argyrosperma subsp. argyrosperma 16.3328306 -95.2330361 XXXXXX TEH4 Tehuantepec, Oaxaca (D) 18 Cucurbita argyrosperma subsp. argyrosperma 16.3328306 -95.2330361 XXXXXX 122 Onavas, Sonora (D) 19 Cucurbita argyrosperma subsp. argyrosperma 28.533333 -109.583333 XXXXXX 123 Onavas, Sonora (D) 19 Cucurbita argyrosperma subsp. argyrosperma 28.533333 -109.583333 XXXXXX 124 Onavas, Sonora (D) 19 Cucurbita argyrosperma subsp. argyrosperma 28.533333 -109.583333 XXXXXX 125 Onavas, Sonora (D) 19 Cucurbita argyrosperma subsp. argyrosperma 28.533333 -109.583333 XXXXXX 126 Onavas, Sonora (D) 19 Cucurbita argyrosperma subsp. argyrosperma 28.533333 -109.583333 XXXXXX 127 Onavas, Sonora (D) 19 Cucurbita argyrosperma subsp. argyrosperma 28.533333 -109.583333 XXXXXX VER1 Tihuatlán, Veracruz (D) 20 Cucurbita argyrosperma subsp. argyrosperma 20.7200667 -97.5395028 XXXXXX VER2 Tihuatlán, Veracruz (D) 20 Cucurbita argyrosperma subsp. argyrosperma 20.7200667 -97.5395028 XXXXXX VER3 Tihuatlán, Veracruz (D) 20 Cucurbita argyrosperma subsp. argyrosperma 20.7200667 -97.5395028 XXXXXX VER4 Tihuatlán, Veracruz (D) 20 Cucurbita argyrosperma subsp. argyrosperma 20.7200667 -97.5395028 XXXXXX 120 Tihuatlán, Veracruz (D) 20 Cucurbita argyrosperma subsp. argyrosperma 20.7200667 -97.5395028 XXXXXX PAL1 Palenque, Chiapas (D) 21 Cucurbita argyrosperma subsp. argyrosperma 17.5128139 -91.9877611 XXXXXX PAL13 Palenque, Chiapas (D) 21 Cucurbita argyrosperma subsp. argyrosperma 17.5128139 -91.9877611 XXXXXX PAL20 Palenque, Chiapas (D) 21 Cucurbita argyrosperma subsp. argyrosperma 17.5128139 -91.9877611 XXXXXX PAL21 Palenque, Chiapas (D) 21 Cucurbita argyrosperma subsp. argyrosperma 17.5128139 -91.9877611 XXXXXX PAL3 Palenque, Chiapas (D) 21 Cucurbita argyrosperma subsp. argyrosperma 17.5128139 -91.9877611 XXXXXX PAL4 Palenque, Chiapas (D) 21 Cucurbita argyrosperma subsp. argyrosperma 17.5128139 -91.9877611 XXXXXX SLP1 Tanquián, San Luis Potosí (D) 22 Cucurbita argyrosperma subsp. argyrosperma 22.1154861 -101.009331 XXXXXX SLP2 Tanquián, San Luis Potosí (D) 22 Cucurbita argyrosperma subsp. argyrosperma 22.1154861 -101.009331 XXXXXX 149 SLP3 Tanquián, San Luis Potosí (D) 22 Cucurbita argyrosperma subsp. argyrosperma 22.1154861 -101.009331 XXXXXX SLP4 Tanquián, San Luis Potosí (D) 22 Cucurbita argyrosperma subsp. argyrosperma 22.1154861 -101.009331 XXXXXX SLP5 Tanquián, San Luis Potosí (D) 22 Cucurbita argyrosperma subsp. argyrosperma 22.1154861 -101.009331 XXXXXX SLP6 Tanquián, San Luis Potosí (D) 22 Cucurbita argyrosperma subsp. argyrosperma 22.1154861 -101.009331 XXXXXX CHAMP1 Champotón, Campeche (D) 23 Cucurbita argyrosperma subsp. argyrosperma 19.5011556 -90.4613778 XXXXXX CHAMP2 Champotón, Campeche (D) 23 Cucurbita argyrosperma subsp. argyrosperma 19.5011556 -90.4613778 XXXXXX CHAMP3 Champotón, Campeche (D) 23 Cucurbita argyrosperma subsp. argyrosperma 19.5011556 -90.4613778 XXXXXX CHAMP4 Champotón, Campeche (D) 23 Cucurbita argyrosperma subsp. argyrosperma 19.5011556 -90.4613778 XXXXXX CHAMP5 Champotón, Campeche (D) 23 Cucurbita argyrosperma subsp. argyrosperma 19.5011556 -90.4613778 XXXXXX MIXT1 Mixtepec, Oaxaca (D) 24 Cucurbita argyrosperma subsp. argyrosperma 15.95875 -97.0849167 XXXXXX MIXT2 Mixtepec, Oaxaca (D) 24 Cucurbita argyrosperma subsp. argyrosperma 15.95875 -97.0849167 XXXXXX MIXT3 Mixtepec, Oaxaca (D) 24 Cucurbita argyrosperma subsp. argyrosperma 15.95875 -97.0849167 XXXXXX MIXT4 Mixtepec, Oaxaca (D) 24 Cucurbita argyrosperma subsp. argyrosperma 15.95875 -97.0849167 XXXXXX MIXT5 Mixtepec, Oaxaca (D) 24 Cucurbita argyrosperma subsp. argyrosperma 15.95875 -97.0849167 XXXXXX MIXT6 Mixtepec, Oaxaca (D) 24 Cucurbita argyrosperma subsp. argyrosperma 15.95875 -97.0849167 XXXXXX EK1 Ek Balam, Yucatan (D) 25 Cucurbita argyrosperma subsp. argyrosperma 20.9166667 -87.9166667 XXXXXX EK2 Ek Balam, Yucatan (D) 25 Cucurbita argyrosperma subsp. argyrosperma 20.9166667 -87.9166667 XXXXXX EK3 Ek Balam, Yucatan (D) 25 Cucurbita argyrosperma subsp. argyrosperma 20.9166667 -87.9166667 XXXXXX EK4 Ek Balam, Yucatan (D) 25 Cucurbita argyrosperma subsp. argyrosperma 20.9166667 -87.9166667 XXXXXX EK5 Ek Balam, Yucatan (D) 25 Cucurbita argyrosperma subsp. argyrosperma 20.9166667 -87.9166667 XXXXXX EK6 Ek Balam, Yucatan (D) 25 Cucurbita argyrosperma subsp. argyrosperma 20.9166667 -87.9166667 XXXXXX CHAN1 Chan Santa Cruz, Quintana Roo (D) 26 Cucurbita argyrosperma subsp. argyrosperma 19.3670639 -88.3328917 XXXXXX CHAN2 Chan Santa Cruz, Quintana Roo (D) 26 Cucurbita argyrosperma subsp. argyrosperma 19.3670639 -88.3328917 XXXXXX CHAN33 Chan Santa Cruz, Quintana Roo (D) 26 Cucurbita argyrosperma subsp. argyrosperma 19.3670639 -88.3328917 XXXXXX 150 CHAN36 Chan Santa Cruz, Quintana Roo (D) 26 Cucurbita argyrosperma subsp. argyrosperma 19.3670639 -88.3328917 XXXXXX CHAN3 Chan Santa Cruz, Quintana Roo (D) 26 Cucurbita argyrosperma subsp. argyrosperma 19.3670639 -88.3328917 XXXXXX CHANT4 Chan Santa Cruz, Quintana Roo (D) 26 Cucurbita argyrosperma subsp. argyrosperma 19.3670639 -88.3328917 XXXXXX CHANT5 Chan Santa Cruz, Quintana Roo (D) 26 Cucurbita argyrosperma subsp. argyrosperma 19.3670639 -88.3328917 XXXXXX 151 Table S3. Genetic diversity of each wild, domesticated and feral population of Cucurbita argyrosperma using 2,861 unlinked SNPs (r2 < 0.25, MAF > 1%). (N = sample size, HO = observed heterozygosity, HE = expected heterozygosity, π = nucleotide diversity, FIS = inbreeding coefficient, Var = variance) (within population names: W = wild, D = domesticated) Population name Population number N HO (Var) HE (Var) π (Var) FIS (Var) Cucurbita moschata (Outgroup) 0 5 0.077 (0.042) 0.058 (0.019) 0.068 (0.029) -0.018 (0.017) Jiquipilas, Chiapas (W) 1 8 0.091 (0.039) 0.073 (0.020) 0.082 (0.027) -0.020 (0.017) Ometepec, Guerrero (W) 2 5 0.089 (0.038) 0.076 (0.021) 0.088 (0.030) 0.001 (0.027) Puerto Escondido, Oaxaca (W) 3 14 0.094 (0.031) 0.079 (0.018) 0.085 (0.022) -0.021 (0.017) Jalisco (W) 4 17 0.115 (0.033) 0.099 (0.020) 0.106 (0.024) -0.020 (0.020) Tlapehuala, Guerrero (D) 5 10 0.098 (0.029) 0.086 (0.018) 0.093 (0.022) -0.010 (0.023) Jalisco (D) 6 13 0.099 (0.025) 0.090 (0.017) 0.096 (0.020) -0.006 (0.022) Badiraguato, Sinaloa (D) 7 1 0.095 (0.086) 0.047 (0.021) 0.095 (0.086) 0.000 (0.000) Matlalapa, Guerrero (D) 8 7 0.092 (0.029) 0.081 (0.018) 0.090 (0.023) -0.006 (0.017) Sahuayo, Michoacán (D) 9 10 0.101 (0.027) 0.091 (0.018) 0.097 (0.021) -0.008 (0.022) Salamanca, Guanajuato (D) 10 1 0.087 (0.079) 0.043 (0.020) 0.087 (0.079) 0.000 (0.000) Tepic, Nayarit (D) 11 4 0.089 (0.044) 0.067 (0.020) 0.083 (0.034) -0.012 (0.012) El Platanar, Sinaloa (feral) 12 4 0.107 (0.062) 0.074 (0.024) 0.097 (0.046) -0.019 (0.018) Culiacán, Sinaloa (feral) 13 3 0.099 (0.064) 0.066 (0.023) 0.094 (0.053) -0.010 (0.014) Choix, Sinaloa (D) 14 3 0.101 (0.060) 0.069 (0.023) 0.096 (0.050) -0.008 (0.014) Yecora, Sonora (D) 15 6 0.104 (0.040) 0.081 (0.020) 0.092 (0.027) -0.025 (0.013) Onavas, Sonora (feral) 16 7 0.101 (0.044) 0.077 (0.022) 0.088 (0.029) -0.026 (0.015) Durango, Durango (D) 17 1 0.076 (0.071) 0.038 (0.017) 0.076 (0.071) 0.000 (0.000) Tehuantepec, Oaxaca (D) 18 7 0.085 (0.032) 0.071 (0.018) 0.079 (0.023) -0.014 (0.014) Onavas, Sonora (D) 19 6 0.089 (0.040) 0.067 (0.018) 0.078 (0.026) -0.025 (0.013) Tihuatlán, Veracruz (D) 20 5 0.088 (0.041) 0.069 (0.020) 0.082 (0.030) -0.011 (0.018) Palenque, Chiapas (D) 21 5 0.089 (0.035) 0.072 (0.019) 0.083 (0.026) -0.013 (0.015) Tanquián, San Luis Potosí (D) 22 6 0.090 (0.033) 0.074 (0.019) 0.084 (0.025) -0.013 (0.013) Champotón, Campeche (D) 23 5 0.096 (0.039) 0.076 (0.020) 0.090 (0.030) -0.011 (0.016) Mixtepec, Oaxaca (D) 24 6 0.101 (0.038) 0.081 (0.020) 0.092 (0.027) -0.020 (0.016) Ek Balam, Yucatan (D) 25 6 0.098 (0.041) 0.075 (0.020) 0.086 (0.027) -0.026 (0.014) Chan Santa Cruz, Quintana Roo (D) 26 7 0.089 (0.032) 0.073 (0.018) 0.081 (0.022) -0.018 (0.016) 152 Table S4. Structural analysis performed 5,000 bp upstream and downstream of the 19 SNPs that were predicted as outliers by both BayeScEnv and PCAdapt. The structural analysis was performed by comparing the wild and domesticated reference genomes. CHR Candidate SNP position Candidate gene (domesticated) Candidate gene (wild) Gene annotation Description 1 1131161 Carg07327 Csor.00g19 6260 Uncharacterized protein We found some small indels on the intergenic DNA and the introns, most probably neutral. We also found three non-synonymous mutations within the uncharacterized protein. 1 3254988 Carg12374 Csor.00g19 2590 serine/threonine- protein kinase PBL10 We found several large deletions upstream of PBL10 (2000 bp upstream of the candidate SNP), possibly disrupting its transcriptional activity. The gene structure of PBL10 is well conserved between subspecies, with a few mutations within the introns, most likely neutral. Four substitutions were also found within the exons, two of which are non-synonymous. 1 4158310 Carg06982 Csor.00g20 3760 HSP17.8 17.8 kDa class I heat shock protein We detected two large indels downstream of HSP17.8 (4,700 bp from the candidate SNP) and a 3 bp insertion in the 5'UTR of HSP17.8. We also found several indels between HSP17.8 and an uncharacterized protein, one of whose size was 422 nt long, possibly disrupting an unknown regulatory sequence. 1 10918103 Carg_TCONS_0 0084066 NA lincRNA Several insertions/deletions are disrupting the lincRNAs around the candidate SNP. 3 4256305, 4256319 Carg19982 Csor.00g19 9760 UPF0503 protein At3g09070 (OCTOPUS) There are two nonsynonymous substitutions and a 12 nt insertion within the open reading frame of the OCTOPUS gene in C. argyrosperma subsp. sororia, (~2 kb away from the candidate SNPs). However, this insertion may be a derived state in C. argyrosperma subsp. sororia, as the genomes of C. argyrosperma subsp. argyrosperma, C. moschata, and C. pepo subsp. pepo all lack that same 12 nt insertion (Sun 2017, Montero-Pau 2018, Barrera-Redondo 2019). 5 5237485 TE TE NA The candidate SNP was localized within a transposable element. 4 16492565 Intergenic DNA Intergenic DNA NA We detected a 711 nt deletion within the domesticated genome, in the intergenic DNA between Rab7 and a long-noncoding RNA 6 9854562 Carg_TCONS_0 0051014 NA lincRNA We found 4 mutations between the wild and the domesticated genome in a predicted long noncoding RNA downstream of the candidate SNP (<100 bp away from the candidate SNP), suggesting it is the possible target of the selective sweep. 153 7 1584624 Carg07889 Csor.00g11 3710 Phosphatidylinositol/ phosphatidylcholine transfer protein SFH9 Candidate SNP located within an intron. Other variants are seen within the intron, but they don't disrupt the structure of the gene. A 29 nt deletion was detected within one of the introns (~1,800 bp upstream of candidate SNP). A 1 nt indel was found on the 5'UTR of SFH9. A 20 nt insertion was also found between SFH9 and KINB2 (1,000 bp downstream of SNP). 7 3024258 Carg18571 Csor.00g11 0800 AGP18 Lysine-rich arabinogalactan protein 18 There are two insertions and one deletion in the first exon of AGP18 in the domesticated genome, disrupting its open reading frame. 8 7179125 Carg15929 Csor.00g23 7260 uncharacterized protein At5g19025 No relevant differences between reference genomes. 9 4725402 Intergenic DNA Intergenic DNA NA We observed a 1,545 nt insertion within the domesticated genome, in an intergenic spacer between a long noncoding RNA and CPK9. 15 6673629 Carg26784 Csor.00g05 4230 Uncharacterized protein There is a 48 nt insertion in the domesticated genome within an intron of the uncharacterized protein. We also found a 63 nt deletion in the domesticated genome upstream of a small heat shock protein. 16 3211595 Carg22232 Csor.00g24 2160 Ribonucleases P/MRP protein subunit POP1 We found a 229 nt insertion in the domesticated genome upstream of POP1. 16 6190329 Carg24812 NA serine/threonine- protein kinase PBL23 There is a 915 nt insertion downstream of PBL23 in the domesticated genome. Both genomes become unalignable after 2500 nt upstream of the candidate SNP. 16 7224386, 7224386, 7224444 TE TE Possible uncharacterized protein Three candidate SNPs found in a row (within 100 bp). Many large and small rearrangements throughout the candidate region. No gene was found on either reference genome, but the site containing the candidate SNPs closely resembles an uncharacterized protein predicted in C. maxima (93.16% identity, 91% query cover), possibly a transposon protein. 154 Table S5. Cucurbita argyrosperma genes containing a candidate SNP (predicted by either BayeScEnv or PCAdapt) within their inner structure (introns, exons, UTRs). (AED = Annotation Edit Distance; GO ID = Gene Ontology ID) Gene ID (domesticated genome) Gene ID (wild genome) Chromosome location Functional annotation against SwissProt AED GO ID Carg00678 Csor.00g059490 Chr03 Similar to CSI1 Protein CELLULOSE SYNTHASE INTERACTIVE 1 (Arabidopsis thaliana) 0.06 GO:0005515 Carg00749 Csor.00g058770 Chr03 Similar to WDR5A COMPASS-like H3K4 histone methylase component WDR5A (Arabidopsis thaliana) 0.01 GO:0005515 Carg00754 Csor.00g058710 Chr03 Similar to TUBA Tubulin alpha chain (Prunus dulcis) 0.07 GO:0003924 Carg00755 NA Chr03 Protein of unknown function 0.16 Carg00942 Csor.00g292650 Chr06 Similar to CTN Cactin (Arabidopsis thaliana) 0.14 GO:0005515 Carg01177 Csor.00g247900 Chr06 Protein of unknown function 0.13 GO:0071816 Carg01302 Csor.00g249070 Chr06 Similar to SSL10 Protein STRICTOSIDINE SYNTHASE-LIKE 10 (Arabidopsis thaliana) 0.13 GO:0009058, GO:0016844 Carg01309 Csor.00g249150 Chr06 Similar to ARF1 Auxin response factor 1 (Arabidopsis thaliana) 0.15 Carg01823 Csor.00g164870 Chr04 Similar to PP2AA2 Serine/threonine-protein phosphatase 2A 65 kDa regulatory subunit A beta isoform (Arabidopsis thaliana) 0.19 GO:0005515 Carg02429 Csor.00g032070 Chr15 Protein of unknown function 0.11 Carg02490 Csor.00g220720 Chr02 Similar to FZR1 Protein FIZZY-RELATED 1 (Arabidopsis thaliana) 0.17 Carg02612 Csor.00g221980 Chr02 Similar to TIC100 Protein TIC 100 (Arabidopsis thaliana) 0.21 Carg02896 Csor.00g176320 Chr09 Similar to AKT1 Potassium channel AKT1 (Arabidopsis thaliana) 0.14 GO:0005216, GO:0006811, GO:0016020, GO:0055085 Carg02996 NA Chr09 Similar to At5g49980 Transport inhibitor response 1-like protein (Arabidopsis thaliana) 0.02 GO:0005515 155 Carg03224 Csor.00g101370 Chr20 Similar to DDB_G0284757 OTU domain- containing protein DDB_G0284757 (Dictyostelium discoideum) 0.31 Carg03669 NA Chr14 Similar to At3g10130 Heme-binding-like protein At3g10130, chloroplastic (Arabidopsis thaliana) 0.2 Carg03798 Csor.00g009600 Chr08 Similar to At1g09760 U2 small nuclear ribonucleoprotein A' (Arabidopsis thaliana) 0.06 Carg04098 Csor.00g278800 Chr19 Similar to GNTI Alpha-1,3-mannosyl-glycoprotein 2-beta-N-acetylglucosaminyltransferase (Arabidopsis thaliana) 0.23 GO:0006486, GO:0008375 Carg04189 Csor.00g196640 Chr01 Similar to RAD23B Ubiquitin receptor RAD23b (Arabidopsis thaliana) 0.16 GO:0003684, GO:0005515, GO:0005634, GO:0006289, GO:0043161 Carg04403 Csor.00g121120 Chr04 Similar to P4H7 Probable prolyl 4-hydroxylase 7 (Arabidopsis thaliana) 0.15 GO:0016491, GO:0055114 Carg04520 Csor.00g122340 Chr04 Similar to RBCMT Ribulose-1,5 bisphosphate carboxylase/oxygenase large subunit N- methyltransferase, chloroplastic (Nicotiana tabacum) 0.23 GO:0005515 Carg04587 Csor.00g123000 Chr04 Similar to At3g53190 Probable pectate lyase 12 (Arabidopsis thaliana) 0.06 Carg04908 Csor.00g009170 Chr08 Similar to RPS15AE 40S ribosomal protein S15a- 5 (Arabidopsis thaliana) 0.11 GO:0003735, GO:0005840, GO:0006412 Carg04919 Csor.00g009070 Chr08 Similar to U2AF65B Splicing factor U2af large subunit B (Nicotiana plumbaginifolia) 0.3 GO:0003676, GO:0003723, GO:0005634, GO:0006397 Carg04921 Csor.00g009020 Chr08 Similar to ATX2 Histone-lysine N- methyltransferase ATX2 (Arabidopsis thaliana) 0.23 GO:0005515, GO:0005634 156 Carg04961 Csor.00g008680 Chr08 Similar to CRK3 CDPK-related kinase 3 (Arabidopsis thaliana) 0.18 GO:0004672, GO:0005524, GO:0006468 Carg04969 Csor.00g008610 Chr08 Similar to CAT9 Cationic amino acid transporter 9, chloroplastic (Arabidopsis thaliana) 0.15 GO:0016020, GO:0022857, GO:0055085 Carg05198 Csor.00g110710 Chr07 Similar to MYB44 Transcription factor MYB44 (Arabidopsis thaliana) 0.01 GO:0003677 Carg05727 Csor.00g172100 Chr16 Similar to KIN7G Kinesin-like protein KIN-7G (Arabidopsis thaliana) 0.11 GO:0003777, GO:0005524, GO:0007018, GO:0008017 Carg05757 Csor.00g172420 Chr16 Similar to BGAL3 Beta-galactosidase 3 (Arabidopsis thaliana) 0.17 GO:0030246 Carg06107 Csor.00g118480 Chr04 Similar to OPR1 12-oxophytodienoate reductase 1 (Arabidopsis thaliana) 0.32 GO:0010181, GO:0016491, GO:0055114 Carg06628 Csor.00g046050 Chr18 Similar to NMT1 Phosphoethanolamine N- methyltransferase 1 (Arabidopsis thaliana) 0.2 GO:0008168 Carg06661 Csor.00g045750 Chr18 Protein of unknown function 0.15 Carg06696 Csor.00g045440 Chr18 Similar to At5g11010 Polynucleotide 5'-hydroxyl- kinase NOL9 (Arabidopsis thaliana) 0.19 Carg06792 Csor.00g044470 Chr18 Similar to MSP1 Protein MSP1 (Saccharomyces cerevisiae (strain ATCC 204508 / S288c)) 0.22 GO:0005515, GO:0005524 Carg06849 Csor.00g043850 Chr18 Similar to SS3 Soluble starch synthase 3, chloroplastic/amyloplastic (Solanum tuberosum) 0.11 GO:2001070 Carg06968 Csor.00g203640 Chr01 Similar to ORRM6 Organelle RRM domain- containing protein 6, chloroplastic (Arabidopsis thaliana) 0.18 GO:0003676 Carg06997 Csor.00g080170 Chr01 Similar to IREH1 Probable serine/threonine protein kinase IREH1 (Arabidopsis thaliana) 0.08 GO:0004672, GO:0005524, GO:0006468 Carg07232 Csor.00g265760 Chr17 Similar to HULK3 Protein HUA2-LIKE 3 (Arabidopsis thaliana) 0.22 157 Carg07327 Csor.00g196260 Chr01 Similar to dusA tRNA-dihydrouridine(20/20a) synthase (Vibrio vulnificus (strain CMCP6)) 0.27 GO:0008033, GO:0017150, GO:0050660, GO:0055114 Carg07674 Csor.00g064770 Chr13 Similar to apaG Protein ApaG (Magnetospirillum magneticum (strain AMB-1 / ATCC 700264)) 0.21 GO:0005515 Carg07889 Csor.00g113710 Chr07 Similar to SFH9 Phosphatidylinositol/phosphatidylcholine transfer protein SFH9 (Arabidopsis thaliana) 0.14 Carg07954 Csor.00g113070 Chr07 Similar to BHLH121 Transcription factor bHLH121 (Arabidopsis thaliana) 0.03 GO:0046983 Carg08549 Csor.00g041110 Chr02 Similar to At4g10930 Uncharacterized protein At4g10930 (Arabidopsis thaliana) 0.19 Carg09452 Csor.00g084200 Chr17 Similar to ALA4 Probable phospholipid- transporting ATPase 4 (Arabidopsis thaliana) 0.09 GO:0000166, GO:0000287, GO:0005524, GO:0015914, GO:0016021, GO:0140326 Carg09511 Csor.00g083590 Chr17 Similar to IAA27 Auxin-responsive protein IAA27 (Arabidopsis thaliana) 0.14 Carg10718 Csor.00g080500 Chr01 Similar to FBL15 F-box/LRR-repeat protein 15 (Arabidopsis thaliana) 0.16 GO:0005515 Carg10909 Csor.00g085670 Chr05 Protein of unknown function 0.04 Carg10970 Csor.00g236380 Chr08 Similar to trc Serine/threonine-protein kinase tricorner (Drosophila pseudoobscura pseudoobscura) 0.14 GO:0004672, GO:0004674, GO:0005524, GO:0006468 Carg11153 Csor.00g150050 Chr14 Protein of unknown function 0.07 Carg11621 Csor.00g072250 Chr02 Similar to Zeaxanthin epoxidase, chloroplastic (Prunus armeniaca) 0.12 GO:0005515, GO:0009507, GO:0009540, GO:0009688, GO:0016020, 158 GO:0055114, GO:0071949 Carg11936 Csor.00g166690 Chr05 Similar to PUB6 U-box domain-containing protein 6 (Arabidopsis thaliana) 0.1 GO:0004842, GO:0016567 Carg12108 Csor.00g006010 Chr05 Similar to ATAD1 ATPase family AAA domain- containing protein 1 (Bos taurus) 0.17 GO:0005524 Carg12374 Csor.00g192590 Chr01 Similar to PBL10 Probable serine/threonine- protein kinase PBL10 (Arabidopsis thaliana) 0.19 GO:0004672, GO:0006468 Carg12525 Csor.00g094660 Chr14 Similar to EMB1444 Transcription factor EMB1444 (Arabidopsis thaliana) 0.24 GO:0046983 Carg12589 Csor.00g095250 Chr14 Similar to AP2 Floral homeotic protein APETALA 2 (Arabidopsis thaliana) 0.07 GO:0003677, GO:0003700, GO:0006355 Carg12845 Csor.00g037030 Chr09 Similar to GDPDL4 Glycerophosphodiester phosphodiesterase GDPDL4 (Arabidopsis thaliana) 0.1 GO:0006629, GO:0008081 Carg13010 Csor.00g214890 Chr10 Similar to At4g29530 Thiamine phosphate phosphatase-like protein (Arabidopsis thaliana) 0.11 GO:0016791 Carg13344 Csor.00g042780 Chr07 Similar to CSTF77 Cleavage stimulation factor subunit 77 (Arabidopsis thaliana) 0.23 GO:0005515, GO:0005634, GO:0006397 Carg13413 Csor.00g005190 Chr08 Similar to Sacs Sacsin (Mus musculus) 0.06 Carg13432 Csor.00g005000 Chr08 Similar to efr3b Protein EFR3 homolog B (Danio rerio) 0.16 Carg13651 Csor.00g132700 Chr02 Similar to Unc45a Protein unc-45 homolog A (Mus musculus) 0.12 GO:0005515 Carg14212 Csor.00g024080 Chr04 Similar to SCAI Protein SCAI (Homo sapiens) 0.08 GO:0003714, GO:0006351 Carg14376 Csor.00g056580 Chr04 Similar to CLE10 CLAVATA3/ESR (CLE)-related protein 10 (Arabidopsis thaliana) 0.22 Carg14512 Csor.00g157750 Chr18 Similar to MKP1 Protein-tyrosine-phosphatase MKP1 (Arabidopsis thaliana) 0.01 GO:0008138, GO:0016311 Carg14536 Csor.00g207860 Chr06 Similar to At5g03900 Uncharacterized protein At5g03900, chloroplastic (Arabidopsis thaliana) 0.18 159 Carg14932 Csor.00g116960 Chr07 Similar to VPS54 Vacuolar protein sorting- associated protein 54, chloroplastic (Arabidopsis thaliana) 0.26 GO:0005515, GO:0008080, GO:0042147 Carg14976 Csor.00g116510 Chr07 Similar to ABCC2 ABC transporter C family member 2 (Arabidopsis thaliana) 0.88 GO:0005524, GO:0016021, GO:0016887, GO:0042626, GO:0055085 Carg15060 Csor.00g282910 Chr16 Similar to IQD1 Protein IQ-DOMAIN 1 (Arabidopsis thaliana) 0.09 GO:0005515 Carg15210 Csor.00g217960 Chr10 Protein of unknown function 0.33 Carg15512 Csor.00g251220 Chr06 Similar to SEC23 Protein transport protein SEC23 (Ustilago maydis (strain 521 / FGSC 9021)) 0.14 GO:0006886, GO:0006888, GO:0008270, GO:0030127 Carg15691 Csor.00g281360 Chr19 Similar to Os03g0733400 Zinc finger BED domain-containing protein RICESLEEPER 2 (Oryza sativa subsp. japonica) 0.11 GO:0003677, GO:0046983 Carg15904 Csor.00g236940 Chr08 Similar to SGS3 Protein SUPPRESSOR OF GENE SILENCING 3 homolog (Oryza sativa subsp. indica) 0.13 GO:0031047 Carg15929 Csor.00g237260 Chr08 Similar to At5g19025 Uncharacterized protein At5g19025 (Arabidopsis thaliana) 0.28 Carg16055 Csor.00g112640 Chr07 Similar to AUL1 Auxilin-like protein 1 (Arabidopsis thaliana) 0.04 Carg16433 NA Chr12 Similar to Gtp-bp Signal recognition particle receptor subunit alpha homolog (Drosophila melanogaster) 0.08 GO:0003924, GO:0005047, GO:0005525, GO:0005785, GO:0006614, GO:0006886 Carg17189 Csor.00g139890 Chr01 Similar to FPP4 Filament-like plant protein 4 (Arabidopsis thaliana) 0.11 160 Carg17222 Csor.00g304480 Chr18 Similar to CUT1 3-ketoacyl-CoA synthase 6 (Arabidopsis thaliana) 0.02 GO:0003824, GO:0006633, GO:0016020, GO:0016747 Carg17814 Csor.00g228870 Chr10 Protein of unknown function 0.21 Carg17827 Csor.00g228730 Chr10 Similar to BSL2 Serine/threonine-protein phosphatase BSL2 (Arabidopsis thaliana) 0.1 GO:0004721, GO:0005515, GO:0009742, GO:0016787 Carg18146 Csor.00g156040 Chr11 Protein of unknown function 0.13 Carg18171 Csor.00g211980 Chr11 Similar to AHA11 ATPase 11, plasma membrane- type (Arabidopsis thaliana) 0.11 GO:0008553, GO:0016021, GO:0120029 Carg18484 Csor.00g111760 Chr07 Similar to At1g04910 Uncharacterized protein At1g04910 (Arabidopsis thaliana) 0.25 Carg18727 Csor.00g105010 Chr17 Similar to serinc Probable serine incorporator (Nematostella vectensis) 0.07 GO:0016020 Carg18786 Csor.00g231070 Chr08 Similar to CURT1C Protein CURVATURE THYLAKOID 1C, chloroplastic (Arabidopsis thaliana) 0.26 Carg18944 Csor.00g084780 Chr17 Similar to ATL3 RING-H2 finger protein ATL3 (Arabidopsis thaliana) 0.01 Carg19818 Csor.00g017820 Chr05 Similar to SPCC1672.07 U3 small nucleolar RNA- associated protein 21 homolog (Schizosaccharomyces pombe (strain 972 / ATCC 24843)) 0.18 GO:0005515, GO:0006364, GO:0032040 Carg20078 Csor.00g227200 Chr09 Similar to ABCE2 ABC transporter E family member 2 (Arabidopsis thaliana) 0.12 GO:0005524, GO:0016887 Carg20623 Csor.00g016670 Chr13 Protein of unknown function 0.03 GO:0005515 Carg20889 Csor.00g273290 Chr06 Protein of unknown function 0.18 Carg21397 Csor.00g161930 Chr13 Similar to AGD12 ADP-ribosylation factor GTPase-activating protein AGD12 (Arabidopsis thaliana) 0.16 GO:0005096 161 Carg21521 Csor.00g160720 Chr13 Similar to SNI1 Negative regulator of systemic acquired resistance SNI1 (Arabidopsis thaliana) 0.19 Carg21706 Csor.00g203020 Chr01 Similar to DGK1 Diacylglycerol kinase 1 (Arabidopsis thaliana) 0.06 GO:0004143, GO:0007205, GO:0016301, GO:0035556 Carg22034 Csor.00g277150 Chr19 Similar to PDV2 Plastid division protein PDV2 (Arabidopsis thaliana) 0.12 Carg22092 Csor.00g304880 Chr18 Similar to Arfrp1 ADP-ribosylation factor-related protein 1 (Rattus norvegicus) 0.09 GO:0005525 Carg22232 Csor.00g242160 Chr16 Similar to POP1 Ribonucleases P/MRP protein subunit POP1 (Homo sapiens) 0.1 Carg22875 Csor.00g267800 Chr15 Similar to SPBC3E7.09 Uncharacterized protein slp1 (Schizosaccharomyces pombe (strain 972 / ATCC 24843)) 0.11 Carg22878 Csor.00g267780 Chr15 Similar to PTD Protein PARTING DANCERS (Arabidopsis thaliana) 0.32 Carg22996 Csor.00g086420 Chr05 Similar to FLK Flowering locus K homology domain (Arabidopsis thaliana) 0.11 GO:0003723 Carg23167 Csor.00g003720 Chr08 Similar to CES101 G-type lectin S-receptor-like serine/threonine-protein kinase CES101 (Arabidopsis thaliana) 0.04 GO:0004672, GO:0004674, GO:0005524, GO:0006468 Carg23235 Csor.00g188450 Chr16 Similar to Lon protease homolog 2, peroxisomal (Spinacia oleracea) 0.12 GO:0004176, GO:0004252, GO:0005524, GO:0006508 Carg23389 Csor.00g206240 Chr06 Similar to nop12 Nucleolar protein 12 (Schizosaccharomyces pombe (strain 972 / ATCC 24843)) 0.08 GO:0003676 Carg23772 Csor.00g220140 Chr04 Similar to SAC3A SAC3 family protein A (Arabidopsis thaliana) 0.14 162 Carg23802 NA Chr14 Similar to At1g06840 Probable LRR receptor-like serine/threonine-protein kinase At1g06840 (Arabidopsis thaliana) 0.19 GO:0005515 Carg24347 Csor.00g079020 Chr07 Similar to ITN1 Ankyrin repeat-containing protein ITN1 (Arabidopsis thaliana) 0.03 Carg24693 Csor.00g076650 Chr01 Similar to At1g17220 Translation initiation factor IF-2, chloroplastic (Arabidopsis thaliana) 0.08 GO:0003743, GO:0003924, GO:0005525, GO:0006413 Carg24812 NA Chr16 Similar to PBL23 Probable serine/threonine- protein kinase PBL23 (Arabidopsis thaliana) 0.07 GO:0004672, GO:0005524, GO:0006468 Carg24979 Csor.00g267730 Chr15 Similar to Cag_1601 UPF0301 protein Cag_1601 (Chlorobium chlorochromatii (strain CaD3)) 0.29 Carg25109 NA Chr13 Protein of unknown function 0.23 Carg25229 Csor.00g030470 Chr16 Protein of unknown function 0.05 Carg25230 Csor.00g030480 Chr16 Similar to CLC-F Chloride channel protein CLC-f (Arabidopsis thaliana) 0.05 GO:0005247, GO:0006821, GO:0016020, GO:0055085 Carg25231 Csor.00g030490 Chr16 Similar to WRKY2 Probable WRKY transcription factor 2 (Arabidopsis thaliana) 0.03 GO:0003700, GO:0006355, GO:0043565 Carg25337 Csor.00g070030 Chr19 Similar to Glycerol-3-phosphate acyltransferase, chloroplastic (Cucumis sativus) 0.17 GO:0004366, GO:0006650, GO:0016746 Carg25546 Csor.00g002770 Chr17 Similar to TMKL1 Putative kinase-like protein TMKL1 (Arabidopsis thaliana) 0.3 GO:0004672, GO:0005515, GO:0006468 Carg25626 Csor.00g028980 Chr01 Similar to ASP3 Aspartate aminotransferase 3, chloroplastic (Arabidopsis thaliana) 0.14 GO:0009058, GO:0030170 Carg25639 Csor.00g029150 Chr01 Similar to OVA7 Serine--tRNA ligase, chloroplastic/mitochondrial (Arabidopsis thaliana) 0.12 GO:0000166, GO:0004812, GO:0004828, 163 GO:0005524, GO:0006418, GO:0006434 Carg26216 Csor.00g010650 Chr15 Similar to FTSH11 ATP-dependent zinc metalloprotease FTSH 11, chloroplastic/mitochondrial (Arabidopsis thaliana) 0.27 GO:0004222, GO:0005524, GO:0006508, GO:0016020 Carg26378 Csor.00g260430 Chr11 Similar to DLO1 Protein DMR6-LIKE OXYGENASE 1 (Arabidopsis thaliana) 0.16 GO:0016491, GO:0055114 Carg26784 NA Chr15 Protein of unknown function 0.02 Carg26826 Csor.00g030720 Chr16 Similar to PA200 Proteasome activator subunit 4 (Arabidopsis thaliana) 0.14 Carg26857 Csor.00g260990 Chr11 Similar to DDB_G0292320 Protein unc-50 homolog (Dictyostelium discoideum) 0.24 Carg26907 Csor.00g220120 Chr04 Similar to maea Macrophage erythroblast attacher (Danio rerio) 0.14 Carg27113 Csor.00g014920 Chr19 Similar to VPS13C Vacuolar protein sorting- associated protein 13C (Homo sapiens) 0.11 Carg27299 Csor.00g267650 Chr15 Similar to SAC1 Phosphoinositide phosphatase SAC1 (Arabidopsis thaliana) 0.18 GO:0042578 Carg27622 Csor.00g103900 Chr20 Similar to SUMO2 Small ubiquitin-related modifier 2 (Arabidopsis thaliana) 0.32 164 Table S6. Significantly enriched (p < 0.05) Gene Ontology terms for the 125 genes with candidate SNPs within their structure (introns, exons, UTRs). Biological Process GO ID Term p-value GO:0009688 Abscisic acid biosynthetic process 0.0055 GO:0006434 Seryl-tRNA aminoacylation 0.0111 GO:0071816 Tail-anchored membrane protein insertion into ER membrane 0.0111 GO:0009742 Brassinosteroid mediated signaling pathway 0.0274 GO:0042147 Retrograde transport, endosome to Golgi 0.0328 GO:0043161 Proteasome-mediated ubiquitin-dependent protein catabolic process 0.0328 GO:0007205 Protein kinase C-activating G protein-coupled receptor signaling pathway 0.0382 Molecular Function GO ID Term p-value GO:0009540 Zeaxanthin epoxidase [overall] activity 0.0059 GO:0004828 Serine-tRNA ligase activity 0.0117 GO:0004366 Glycerol-3-phosphate O-acyltransferase activity 0.0117 GO:0004176 ATP-dependent peptidase activity 0.0117 GO:0005515 Protein binding 0.0148 GO:0003714 Transcription corepressor activity 0.0233 GO:0005047 Signal recognition particle binding 0.0291 GO:0004674 Protein serine/threonine kinase activity 0.0324 GO:0017150 tRNA dihydrouridine synthase activity 0.0348 GO:0004143 Diacylglycerol kinase activity 0.0405 GO:0016844 Strictosidine synthase activity 0.0405 GO:2001070 Starch binding 0.0405 Cellular Component GO ID Term p-value GO:0005785 Signal recognition particle receptor complex 0.011 GO:0032040 Small-subunit processome 0.043 165 Table S7. Cucurbitacin-related gene orthologs of C. argyrosperma identified by a bidirectional BLAST against the genes reported by Shang et al. (2014) on Cucumis sativus. (AED = Annotation Edit Distance; GO ID = Gene Ontology ID) Gene ID (domesticated genome) Gene ID (wild genome) Chromosome location Functional annotation against SwissProt AED GO ID Carg02215 Csor.00g312480 Chr15 Similar to 5MAT Malonyl-CoA:anthocyanidin 5-O- glucoside-6''-O-malonyltransferase (Arabidopsis thaliana) 0.11 GO:0016747 Carg02488 Csor.00g220700 Chr02 Similar to rpoB DNA-directed RNA polymerase subunit beta (Cucumis sativus) 0.08 GO:0003677, GO:0003899, GO:0006351, GO:0046983 Carg03795 NA Chr08 Similar to CYP81E1 Isoflavone 2'-hydroxylase (Glycyrrhiza echinata) 0.01 GO:0005506, GO:0016705, GO:0020037, GO:0055114 Carg03796 Csor.00g009580 Chr08 Similar to BAHD1 BAHD acyltransferase At5g47980 (Arabidopsis thaliana) 0.02 GO:0016747 Carg03797 Csor.00g009590 Chr08 Similar to CYP87A3 Cytochrome P450 87A3 (Oryza sativa subsp. japonica) 0.14 GO:0005506, GO:0016705, GO:0020037, GO:0055114 Carg06672 Csor.00g045650 Chr18 Similar to CYP87A3 Cytochrome P450 87A3 (Oryza sativa subsp. japonica) 0.1 GO:0005506, GO:0016705, GO:0020037, GO:0055114 Carg07313 Csor.00g196390 Chr01 Similar to CYP705A5 Cytochrome P450 705A5 (Arabidopsis thaliana) 0.12 GO:0005506, GO:0016705, GO:0020037, GO:0055114 Carg08824 Csor.00g156780 Chr02 Similar to CYP88D6 Beta-amyrin 11-oxidase (Glycyrrhiza uralensis) 0.03 GO:0005506, GO:0016705, 166 GO:0020037, GO:0055114 Carg11550 Csor.00g001510 Chr17 Similar to CYP89A9 Cytochrome P450 89A9 (Arabidopsis thaliana) 0 GO:0005506, GO:0016705, GO:0020037, GO:0055114 Carg11551 NA Chr17 Similar to CYP81E8 Cytochrome P450 81E8 (Medicago truncatula) 0.1 GO:0005506, GO:0016705, GO:0020037, GO:0055114 Carg11552 Csor.00g001500 Chr17 Similar to CPQ Cucurbitadienol synthase (Cucurbita pepo) 0.16 GO:0016866 Carg14872 Csor.00g181690 Chr14 Similar to KAO1 Ent-kaurenoic acid oxidase 1 (Arabidopsis thaliana) 0.07 GO:0005506, GO:0016705, GO:0020037, GO:0055114 Carg15241 Csor.00g217650 Chr10 Similar to CPR NADPH--cytochrome P450 reductase (Catharanthus roseus) 0.18 GO:0003958, GO:0010181, GO:0016491, GO:0055114 Carg15423 Csor.00g250380 Chr06 Similar to CYP51G1 Sterol 14-demethylase (Arabidopsis thaliana) 0.03 GO:0005506, GO:0016705, GO:0020037, GO:0055114 Carg16672 Csor.00g308950 Chr15 Similar to BHLH120 Transcription factor bHLH120 (Arabidopsis thaliana) 0.14 GO:0046983 167 Supplemental Figures Figure S1. Synteny dot plots between the chromosome-level genome assemblies of Cucurbita argyrosperma subsp. argyrosperma and Cucurbita argyrosperma subsp. sororia against the reference genomes of Cucurbita moschata and Cucurbita maxima (Sun et al., 2017). 168 Figure S2. Synteny dot plots for chromosome 1 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata, (D) C. moschata and C. maxima. We used C. moschata and C. maxima as outgroups to identify the evolutionary orientation of the observed differences. We found a putative ~1,000,000 nt deletion in chromosome 1 of C. argyrosperma subsp. argyrosperma that occurred after diverging from C. argyrosperma subsp. sororia. We also found a putative inversion in chromosome 1 of C. moschata that occurred after diverging from C. argyrosperma. 169 Figure S3. Synteny dot plots for chromosome 2 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found two putative deletions and one insertion in chromosome 2 of C. argyrosperma subsp. argyrosperma that occurred after diverging from C. argyrosperma subsp. sororia. 170 Figure S4. Synteny dot plots for chromosome 3 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found a huge deletion (>1,000,000 nt) at the end of chromosome 3 in C. argyrosperma subsp. sororia, possibly due to an artifact during chromosome anchoring, as this sequence was found alongside the centromere of chromosome 15. 171 Figure S5. Synteny dot plots for chromosome 4 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found a putative insertion (> 1,000,000 nt) in chromosome 4 of C. argyrosperma subsp. argyrosperma that occurred after diverging from C. argyrosperma subsp. sororia. 172 Figure S6. Synteny dot plots for chromosome 5 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found three insertions in chromosome 5 of C. argyrosperma subsp. argyrosperma, including one that is ~800,000 nt long, that occurred after diverging from C. argyrosperma subsp. sororia. 173 Figure S7. Synteny dot plots for chromosome 6 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found a collapsed centromere assembly in C. argyrosperma subsp. argyrosperma, as well as a ~200,000 nt putative deletion that occurred after diverging from C. argyrosperma subsp. sororia. 174 Figure S8. Synteny dot plots for chromosome 7 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We recovered a larger centromere in C. argyrosperma subsp. sororia, possibly due to a better assembly of the repetitive regions. We found a putative insertion (~400,000 nt) and a putative inversion (~200,000 nt) in chromosome 7 of C. argyrosperma subsp. argyrosperma that occurred after diverging from C. argyrosperma subsp. sororia. We found a candidate SNP within the ~200,000 nt inversion, making this putative rearrangement a possible candidate for selection during domestication. 175 Figure S9. Synteny dot plots for chromosome 8 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found a larger centromere in C. argyrosperma subsp. sororia, as well as other repetitive region upstream, possibly due to a better assembly of the repetitive regions. We also found one putative deletion and one putative insertion in chromosome 8 of C. argyrosperma subsp. argyrosperma that occurred after diverging from C. argyrosperma subsp. sororia. 176 Figure S10. Synteny dot plots for chromosome 9 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found two putative insertions in the chromosome 9 of C. argyrosperma subsp. argyrosperma and one putative insertion in the chromosome 9 of C. argyrosperma subsp. sororia. 177 Figure S11. Synteny dot plots for chromosome 10 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found a shorter centromere in the assembly of chromosome 10 of C. argyrosperma subsp. argyrosperma, possibly due to a better assembly of the repetitive regions in the other two genomes. 178 Figure S12. Synteny dot plots for chromosome 11 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We detected a larger centromere in C. argyrosperma subsp. sororia, possibly due to a better assembly of the repetitive regions. We also found three putative deletions in chromosome 11 of C. argyrosperma subsp. argyrosperma that occurred after diverging from C. argyrosperma subsp. sororia. 179 Figure S13. Synteny dot plots for chromosome 12 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We detected a larger centromere in C. argyrosperma subsp. sororia, possibly due to a better assembly of the repetitive regions. We also found five putative insertions in chromosome 12 of C. argyrosperma subsp. argyrosperma that occurred after diverging from C. argyrosperma subsp. sororia. 180 Figure S14. Synteny dot plots for chromosome 13 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We detected a larger centromere in C. argyrosperma subsp. sororia, possibly due to a better assembly of the repetitive regions. 181 Figure S15. Synteny dot plots for chromosome 14 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found a ~1,000,000 nt putative insertion in chromosome 14 of C. argyrosperma subsp. sororia. 182 Figure S16. Synteny dot plots for chromosome 15 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We detected an insertion of a segment of chromosome 3 near the centromere of chromosome 15 in C. argyrosperma subsp. sororia, possibly due to an artifact during chromosome anchoring. 183 Figure S17. Synteny dot plots for chromosome 16 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found one large putative insertion (~700,000 nt) in C. argyrosperma subsp. argyrosperma between the bases 3,300,000 and 4,000,000 of chromosome 16. We found two smaller putative deletions in C. argyrosperma subsp. argyrosperma corresponding to the positions ~1,200,000 and ~7,500,000 of chromosome 16. 184 Figure S18. Synteny dot plots for chromosome 17 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found a shorter centromere in the assembly of chromosome 17 of C. argyrosperma subsp. argyrosperma. 185 Figure S19. Synteny dot plots for chromosome 18 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We found a ~600,000 nt putative insertion near the centromere of C. argyrosperma subsp. sororia. 186 Figure S20. Synteny dot plots for chromosome 19 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We detected a larger centromere in C. argyrosperma subsp. sororia, possibly due to a better assembly of the repetitive regions. 187 Figure S21. Synteny dot plots for chromosome 20 between (A) C. argyrosperma subsp. argyrosperma and C. argyrosperma subsp. sororia, (B) C. argyrosperma subsp. sororia and C. moschata, (C) C. argyrosperma subsp. argyrosperma and C. moschata. We used C. moschata as outgroup to identify the evolutionary orientation of the observed differences. We detected a larger centromere in C. argyrosperma subsp. sororia, possibly due to a better assembly of the repetitive regions. 188 Figure S22. LD decay graph between the 10,617 SNPs used to perform the selective scans in Cucurbita argyrosperma. 189 Figure S23. Synteny plot between Chromosomes 3 and 7 of Cucurbita argyrosperma. The syntenic location between the two long-noncoding RNAs under positive selection during domestication in chromosome 3 (Carg_TCONS00015730 and Carg_TCONS_00016456) and its protein-coding homolog in chromosome 7 (RMD5) is highlighted in red, as well as the syntenic location of a previously identified long-noncoding RNA that is potentially under purifying selection in chromosome 3 (Carg_TCONS_00015392) and its protein-coding homolog in chromosome 7 (TRAPPC11).